Problem Statement: Humphrey et al. (2013) conducted time-course phosphoproteome profiling of insulin stimulated fat cells (3T3-L1). The resulting data can be used to uncover unknown aspects of insulin pathways which is important for treatment of type II diabetes. Akt and mTOR are believed to be central kinases involved in the insulin signalling fat cells. Kinases regulate their substrates by recognising substrate peptide sequence motif. Substrates of the same kinase often have similar response profiles. This similarity in response profiles can be leveraged to predict novel substrates of Akt and mTOR by using learning features from temporal phosphoproteomics data.
Aim: Prediction is a central application in many real-world data analyses. In this project, we will aim to apply classification techniques for predicting novel kinase-substrates.
Objective: Design a predictive model for identifying novel Akt and mTOR substrates using the InsulinPhospho.txt, AUC_Ins.txt, Akt_substrates.txt and mTOR_substrates.txt datasets.
library(ggplot2)
library(mlbench)
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
library(caret)
## Loading required package: lattice
library(scatterplot3d)
library(MASS)
library(gbm)
## Loaded gbm 2.1.4
library(class)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:randomForest':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(Hmisc)
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
library(xgboost)
##
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
##
## slice
library(MLmetrics)
##
## Attaching package: 'MLmetrics'
## The following objects are masked from 'package:caret':
##
## MAE, RMSE
## The following object is masked from 'package:base':
##
## Recall
library(mice)
##
## Attaching package: 'mice'
## The following objects are masked from 'package:base':
##
## cbind, rbind
library(tidyverse)
## -- Attaching packages ------------------------------------------------------- tidyverse 1.2.1 --
## v tibble 1.4.2 v purrr 0.2.5
## v tidyr 0.8.1 v stringr 1.3.1
## v readr 1.1.1 v forcats 0.3.0
## -- Conflicts ---------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::combine() masks randomForest::combine()
## x tidyr::complete() masks mice::complete()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x purrr::lift() masks caret::lift()
## x randomForest::margin() masks ggplot2::margin()
## x dplyr::select() masks MASS::select()
## x xgboost::slice() masks dplyr::slice()
## x Hmisc::src() masks dplyr::src()
## x Hmisc::summarize() masks dplyr::summarize()
library(corrplot)
## corrplot 0.84 loaded
library(e1071)
##
## Attaching package: 'e1071'
## The following object is masked from 'package:Hmisc':
##
## impute
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
library(stringr)
library(reshape)
##
## Attaching package: 'reshape'
## The following objects are masked from 'package:tidyr':
##
## expand, smiths
## The following object is masked from 'package:dplyr':
##
## rename
## The following object is masked from 'package:class':
##
## condense
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:purrr':
##
## some
## The following object is masked from 'package:dplyr':
##
## recode
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:reshape':
##
## rename
## The following object is masked from 'package:xgboost':
##
## slice
## The following object is masked from 'package:Hmisc':
##
## subplot
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(caret)
source("3_RebuiltAdasamplingFunctions.R")
source("functions.R")
source("3_RebuiltAdasamplingFunctions_SVM.R")
InsulinPhospho <- read.csv("./Data2/Datasets/InsulinPhospho.txt", sep="\t") # Main temporal phosphoproteomics data.
AUC_Ins <- read.csv("./Data2/Datasets/AUC_Ins.txt", sep="\t") # Secondary temporal phosphoproteomics data.
Akt_Substrates <- read.csv("./Data2/Datasets/Akt_substrates.txt", header = F) # Identifiers of known Akt substrates.
mTOR_Substrates <- read.csv("./Data2/Datasets/mTOR_substrates.txt", header = F) # Identifiers of known mTOR substrates.
# Name the column of Akt and mTOR dataframes.
names(Akt_Substrates) <- "Identifier"
names(mTOR_Substrates) <- "Identifier"
# Summary of InsulinPhospho
summary(InsulinPhospho)
## Identifier Seq.Window Avg.Fold
## 100043914;88; : 1 EEQSSASVKKDET: 3 Min. :-3.57566
## 100043914;89; : 1 EMKNEKSEEEQSS: 3 1st Qu.:-0.10775
## 1110018J18RIK;18; : 1 GPEDPKSQVGPED: 3 Median : 0.05211
## 1110037F02RIK;133; : 1 KSEEEQSSASVKK: 3 Mean : 0.18380
## 1110037F02RIK;138; : 1 _____MSETAPAA: 2 3rd Qu.: 0.31962
## 1110037F02RIK;1628;: 1 AAAAGDSDSWDAD: 2 Max. : 4.66251
## (Other) :12056 (Other) :12046
## AUC X15s X30s
## Min. :0.1267 Min. :-6.43115 Min. :-6.048357
## 1st Qu.:0.4210 1st Qu.:-0.16701 1st Qu.:-0.162779
## Median :0.5118 Median : 0.01071 Median : 0.005617
## Mean :0.5080 Mean : 0.02670 Mean : 0.030709
## 3rd Qu.:0.5936 3rd Qu.: 0.20577 3rd Qu.: 0.190593
## Max. :0.8994 Max. : 8.09584 Max. : 5.675916
##
## X1m X2m X5m
## Min. :-4.23629 Min. :-6.50937 Min. :-5.35020
## 1st Qu.:-0.16813 1st Qu.:-0.15529 1st Qu.:-0.14475
## Median : 0.01723 Median : 0.03757 Median : 0.04856
## Mean : 0.05697 Mean : 0.10266 Mean : 0.23983
## 3rd Qu.: 0.20631 3rd Qu.: 0.26007 3rd Qu.: 0.38010
## Max. : 6.52259 Max. : 5.92652 Max. : 6.29936
##
## X10m X20m X60m
## Min. :-4.48161 Min. :-3.80870 Min. :-4.99095
## 1st Qu.:-0.16683 1st Qu.:-0.18933 1st Qu.:-0.20728
## Median : 0.06984 Median : 0.06777 Median : 0.08565
## Mean : 0.32615 Mean : 0.34238 Mean : 0.34501
## 3rd Qu.: 0.51104 3rd Qu.: 0.55507 3rd Qu.: 0.64820
## Max. : 9.68638 Max. : 7.82420 Max. : 9.17293
##
## Ins.1 LY Ins.2
## Min. :-9.2126 Min. :-5.931833 Min. :-9.48170
## 1st Qu.:-0.1994 1st Qu.:-0.295927 1st Qu.:-0.20931
## Median : 0.1000 Median :-0.004296 Median : 0.08076
## Mean : 0.2108 Mean : 0.039761 Mean : 0.23615
## 3rd Qu.: 0.4654 3rd Qu.: 0.294982 3rd Qu.: 0.50019
## Max. : 5.6466 Max. : 4.961114 Max. : 7.41844
##
## MK
## Min. :-5.09015
## 1st Qu.:-0.15985
## Median : 0.03766
## Mean : 0.19743
## 3rd Qu.: 0.34691
## Max. : 7.65471
##
describe(InsulinPhospho)
## InsulinPhospho
##
## 16 Variables 12062 Observations
## ---------------------------------------------------------------------------
## Identifier
## n missing distinct
## 12062 0 12062
##
## lowest : 100043914;88; 100043914;89; 1110018J18RIK;18; 1110037F02RIK;133; 1110037F02RIK;138;
## highest: ZNRF2;73; ZNRF2;75; ZW10;438; ZYX;272; ZYX;336;
## ---------------------------------------------------------------------------
## Seq.Window
## n missing distinct
## 12062 0 11983
##
## lowest : _____MSAGGDFG _____MSAGSSCS _____MSASAGGS _____MSASSLLE _____MSDFDEFE
## highest: YYGRDRSPLRRNA YYLRKDSLSMGVS YYRGSVSLNSSPK YYRRTHSDASDDE YYTSPNSPTSSSR
## ---------------------------------------------------------------------------
## Avg.Fold
## n missing distinct Info Mean Gmd .05 .10
## 12062 0 11985 1 0.1838 0.573 -0.45187 -0.28681
## .25 .50 .75 .90 .95
## -0.10775 0.05211 0.31962 0.85661 1.36784
##
## lowest : -3.575662 -2.982594 -2.875546 -2.719204 -2.564780
## highest: 4.450629 4.505098 4.505757 4.510543 4.662506
## ---------------------------------------------------------------------------
## AUC
## n missing distinct Info Mean Gmd .05 .10
## 12062 0 11985 1 0.508 0.1422 0.2949 0.3419
## .25 .50 .75 .90 .95
## 0.4210 0.5118 0.5936 0.6712 0.7128
##
## lowest : 0.1267419 0.1354366 0.1387156 0.1533232 0.1561299
## highest: 0.8607463 0.8620492 0.8627747 0.8723514 0.8993557
## ---------------------------------------------------------------------------
## X15s
## n missing distinct Info Mean Gmd .05 .10
## 12062 0 10850 1 0.0267 0.4982 -0.66341 -0.42668
## .25 .50 .75 .90 .95
## -0.16701 0.01071 0.20577 0.50007 0.76590
##
## lowest : -6.431152 -5.707707 -4.708113 -4.679189 -4.351233
## highest: 5.861646 5.912577 6.243705 6.505302 8.095836
## ---------------------------------------------------------------------------
## X30s
## n missing distinct Info Mean Gmd .05
## 12062 0 10701 1 0.03071 0.4819 -0.642325
## .10 .25 .50 .75 .90 .95
## -0.398556 -0.162779 0.005617 0.190593 0.464509 0.753963
##
## lowest : -6.048357 -4.580469 -4.344057 -4.257939 -4.008610
## highest: 4.820007 4.864614 5.008597 5.385148 5.675916
## ---------------------------------------------------------------------------
## X1m
## n missing distinct Info Mean Gmd .05 .10
## 12062 0 10821 1 0.05697 0.5354 -0.67915 -0.42421
## .25 .50 .75 .90 .95
## -0.16813 0.01723 0.20631 0.53875 0.86327
##
## lowest : -4.236292 -3.789843 -3.746004 -3.735446 -3.584690
## highest: 4.941939 4.967731 5.074889 5.315292 6.522594
## ---------------------------------------------------------------------------
## X2m
## n missing distinct Info Mean Gmd .05 .10
## 12062 0 10931 1 0.1027 0.5753 -0.62656 -0.41236
## .25 .50 .75 .90 .95
## -0.15529 0.03757 0.26007 0.62645 1.03597
##
## lowest : -6.509371 -4.483367 -3.834470 -3.639205 -3.570564
## highest: 5.113640 5.262066 5.840508 5.884139 5.926518
## ---------------------------------------------------------------------------
## X5m
## n missing distinct Info Mean Gmd .05 .10
## 12062 0 10937 1 0.2398 0.7696 -0.60440 -0.36966
## .25 .50 .75 .90 .95
## -0.14475 0.04856 0.38010 1.17006 1.94407
##
## lowest : -5.350203 -4.232269 -4.044319 -3.620915 -3.302839
## highest: 4.917470 5.133200 5.157161 5.238157 6.299364
## ---------------------------------------------------------------------------
## X10m
## n missing distinct Info Mean Gmd .05 .10
## 12062 0 10941 1 0.3261 0.9349 -0.66801 -0.41619
## .25 .50 .75 .90 .95
## -0.16683 0.06984 0.51104 1.53571 2.46632
##
## lowest : -4.481613 -3.594390 -3.531200 -3.505036 -3.184994
## highest: 5.552755 5.618058 6.025959 6.273357 9.686378
## ---------------------------------------------------------------------------
## X20m
## n missing distinct Info Mean Gmd .05 .10
## 12062 0 10974 1 0.3424 1.012 -0.73443 -0.48258
## .25 .50 .75 .90 .95
## -0.18933 0.06777 0.55507 1.72995 2.62390
##
## lowest : -3.808696 -3.764977 -3.680030 -3.664272 -3.410112
## highest: 5.881314 6.059930 6.373999 6.867712 7.824198
## ---------------------------------------------------------------------------
## X60m
## n missing distinct Info Mean Gmd .05 .10
## 12062 0 11095 1 0.345 1.084 -0.93517 -0.57109
## .25 .50 .75 .90 .95
## -0.20728 0.08565 0.64820 1.82106 2.59184
##
## lowest : -4.990950 -4.282663 -4.127868 -3.973265 -3.874971
## highest: 6.081722 6.281500 7.917290 9.063885 9.172934
## ---------------------------------------------------------------------------
## Ins.1
## n missing distinct Info Mean Gmd .05 .10
## 12062 0 10715 1 0.2108 0.9678 -1.1473 -0.6703
## .25 .50 .75 .90 .95
## -0.1994 0.1000 0.4654 1.4370 2.0989
##
## lowest : -9.212608 -9.069396 -8.781314 -8.731222 -7.243537
## highest: 4.602209 4.663631 4.703433 5.477451 5.646595
## ---------------------------------------------------------------------------
## LY
## n missing distinct Info Mean Gmd .05
## 12062 0 10745 1 0.03976 0.7696 -1.076422
## .10 .25 .50 .75 .90 .95
## -0.720520 -0.295927 -0.004296 0.294982 0.897364 1.428507
##
## lowest : -5.931833 -5.775406 -4.270598 -4.156807 -4.122705
## highest: 4.095418 4.101230 4.183724 4.305606 4.961114
## ---------------------------------------------------------------------------
## Ins.2
## n missing distinct Info Mean Gmd .05 .10
## 12062 0 10857 1 0.2362 1.016 -1.15844 -0.66518
## .25 .50 .75 .90 .95
## -0.20931 0.08076 0.50019 1.60192 2.27527
##
## lowest : -9.481698 -8.560064 -5.663465 -5.433093 -5.309860
## highest: 4.767779 4.818722 4.848097 5.133769 7.418443
## ---------------------------------------------------------------------------
## MK
## n missing distinct Info Mean Gmd .05 .10
## 12062 0 10706 1 0.1974 0.7927 -0.75905 -0.47456
## .25 .50 .75 .90 .95
## -0.15985 0.03766 0.34691 1.19754 1.97925
##
## lowest : -5.090152 -4.707793 -4.466225 -4.301273 -4.295086
## highest: 4.421762 4.516157 4.576791 4.845139 7.654708
## ---------------------------------------------------------------------------
dim(InsulinPhospho)
## [1] 12062 16
# Summary of AUC_Ins
summary(AUC_Ins)
## Identifier Seq.Window AUC
## 100043914;88; : 1 EEQSSASVKKDET: 3 Min. :0.1267
## 100043914;89; : 1 EMKNEKSEEEQSS: 3 1st Qu.:0.4210
## 1110018J18RIK;18; : 1 GPEDPKSQVGPED: 3 Median :0.5118
## 1110037F02RIK;133; : 1 KSEEEQSSASVKK: 3 Mean :0.5080
## 1110037F02RIK;138; : 1 _____MSETAPAA: 2 3rd Qu.:0.5936
## 1110037F02RIK;1628;: 1 AAAAGDSDSWDAD: 2 Max. :0.8994
## (Other) :12056 (Other) :12046
describe(AUC_Ins)
## AUC_Ins
##
## 3 Variables 12062 Observations
## ---------------------------------------------------------------------------
## Identifier
## n missing distinct
## 12062 0 12062
##
## lowest : 100043914;88; 100043914;89; 1110018J18RIK;18; 1110037F02RIK;133; 1110037F02RIK;138;
## highest: ZNRF2;73; ZNRF2;75; ZW10;438; ZYX;272; ZYX;336;
## ---------------------------------------------------------------------------
## Seq.Window
## n missing distinct
## 12062 0 11983
##
## lowest : _____MSAGGDFG _____MSAGSSCS _____MSASAGGS _____MSASSLLE _____MSDFDEFE
## highest: YYGRDRSPLRRNA YYLRKDSLSMGVS YYRGSVSLNSSPK YYRRTHSDASDDE YYTSPNSPTSSSR
## ---------------------------------------------------------------------------
## AUC
## n missing distinct Info Mean Gmd .05 .10
## 12062 0 11985 1 0.508 0.1422 0.2949 0.3419
## .25 .50 .75 .90 .95
## 0.4210 0.5118 0.5936 0.6712 0.7128
##
## lowest : 0.1267419 0.1354366 0.1387156 0.1533232 0.1561299
## highest: 0.8607463 0.8620492 0.8627747 0.8723514 0.8993557
## ---------------------------------------------------------------------------
dim(AUC_Ins)
## [1] 12062 3
# Summary of Akt_Substrates
summary(Akt_Substrates)
## Identifier
## AKT1S1;318;: 1
## AS160;324; : 1
## AS160;595; : 1
## AS160;649; : 1
## BAD;136; : 1
## BRF1;92; : 1
## (Other) :16
describe(Akt_Substrates)
## Akt_Substrates
##
## 1 Variables 22 Observations
## ---------------------------------------------------------------------------
## Identifier
## n missing distinct
## 22 0 22
##
## lowest : AKT1S1;318; AS160;324; AS160;595; AS160;649; BAD;136;
## highest: PFKFB2;486; RANBP3;58; TSC2;1466; TSC2;939; TSC2;981;
## ---------------------------------------------------------------------------
dim(Akt_Substrates)
## [1] 22 1
# Summary of mTOR_Substrates
summary(mTOR_Substrates)
## Identifier
## AHNAK;5536; : 1
## AKT1S1;255; : 1
## ANKRD17;2041;: 1
## APRF;727; : 1
## DEPDC6;265; : 1
## DEPDC6;293; : 1
## (Other) :20
describe(mTOR_Substrates)
## mTOR_Substrates
##
## 1 Variables 26 Observations
## ---------------------------------------------------------------------------
## Identifier
## n missing distinct
## 26 0 26
##
## lowest : AHNAK;5536; AKT1S1;255; ANKRD17;2041; APRF;727; DEPDC6;265;
## highest: RPS6KB1;427; RPS6KB1;444; RPS6KB1;452; TBC1D10B;693; ULK1;763;
## ---------------------------------------------------------------------------
dim(mTOR_Substrates)
## [1] 26 1
Both temporal phosphoproteomics dataframe contain the same number of rows (samples), with different numbers of columns (features). Check to see that the secondary temporal phosphoproteomics dataframe (AUC_Ins) contains different data to InsulinPhospho.
test <- AUC_Ins %>% merge(InsulinPhospho, by="Identifier") # Merge the dataframes using their unique identifier to join on.
# Create a dataframe where AUC_Ins features = InsulinPhospho features.
test.result <- test %>% dplyr::select(Identifier, Seq.Window.x, Seq.Window.y, AUC.x, AUC.y) %>% filter(Seq.Window.x == Seq.Window.y | AUC.x == AUC.y)
head(test.result)
## Identifier Seq.Window.x Seq.Window.y AUC.x AUC.y
## 1 100043914;88; GRHERRSSAEQHS GRHERRSSAEQHS 0.2967288 0.2967288
## 2 100043914;89; RHERRSSAEQHSE RHERRSSAEQHSE 0.5279128 0.5279128
## 3 1110018J18RIK;18; CAGRVPSPAGSVT CAGRVPSPAGSVT 0.5240430 0.5240430
## 4 1110037F02RIK;133; SVDRVISHDRDSP SVDRVISHDRDSP 0.6480379 0.6480379
## 5 1110037F02RIK;138; ISHDRDSPPPPPP ISHDRDSPPPPPP 0.5002719 0.5002719
## 6 1110037F02RIK;1628; FLSEPSSPGRSKT FLSEPSSPGRSKT 0.4370191 0.4370191
write.csv(test.result, file = "./Output/AUC_Ins_test_result.csv")
dim(test.result) # test has 12,062 rows (the same number of rows as InsulinPhospho). This shows that AUC_Ins contains no new data.
## [1] 12062 5
rm(test.result) #drop the test data.frame since we don't actually need it.
# There is no need to consider data from AUC_Ins because it is contained within InsulinPhospho.
Identifiers for known Akt_substrates and mTOR_substrates are found in Akt_substrates and mTOR_substrates data sets. Append known substrates fields to InsulinPhospho columns to create a master data set.
dat <- InsulinPhospho #Name master data set.
# Add Akt Substrates to dat
dat$aktSubstrate <- 0 # Add new column of zeros to dat for aktSubstrate
dat$aktSubstrate[dat$Identifier %in% Akt_Substrates$Identifier] <- 1 # Label rows with known AktSubstrates with a 1.
dat$aktSubstrate <- as.factor(dat$aktSubstrate) # Change labels to factors.
dat %>% filter(aktSubstrate == 1) # There should be 22 Akt Substrates. Check to make sure this is the case.
## Identifier Seq.Window Avg.Fold AUC X15s X30s
## 1 AKT1S1;318; PRPRLNTSDFQKL 3.5063686 0.2030335 1.79324937 2.9496267
## 2 AS160;324; FRSRCSSVTGVMQ 2.6444937 0.2217749 1.09004613 2.4160883
## 3 AS160;595; MRGRLGSMDSFER 3.5000308 0.2449809 1.14603323 2.7169587
## 4 AS160;649; FRRRAHTFSHPPS 2.7125564 0.3436261 1.86233676 2.5158617
## 5 BAD;136; FRGRSRSAPPNLW 1.1134673 0.3063216 0.26743636 0.6608665
## 6 BRF1;92; FRDRSFSEGGERL 2.0512883 0.3971139 0.31345060 1.6545300
## 7 CARHSP1;53; RRTRTFSATVRAS 0.6102895 0.3393443 0.10286623 0.4214935
## 8 ECNOS;1176; SRIRTQSFSLQER 3.3036490 0.3120402 1.95858851 2.9400282
## 9 EDC3;161; FRRRHNSWSSSSR 0.7890872 0.5122196 0.17997063 0.3818158
## 10 FKHR2;252; PRRRAVSMDNSNK 3.3713432 0.2581184 1.79702760 2.7233798
## 11 FOXO1;316; FRPRTSSNASTIS 2.0303375 0.3183012 1.36537800 2.1674276
## 12 GSK3A;21; GRARTSSFAEPGG 1.8109899 0.2360954 1.05994278 1.7976242
## 13 GSK3B;9; GRPRTTSFAESCK 2.2435031 0.2317606 2.07312529 2.0480453
## 14 IRS1;265; FRPRSKSQSSSSC 2.4203834 0.1874690 1.80474628 2.2272248
## 15 IRS1;522; FRKRTHSAGTSPT 3.5017097 0.2202295 3.18479801 3.7653269
## 16 KIAA1272;715; MRFRSATTSGAPG 3.6315602 0.2045568 2.90029590 3.9743589
## 17 PFKFB2;469; VRMRRNSFTPLSS 2.1497783 0.2405669 -0.61652987 2.3467934
## 18 PFKFB2;486; RRPRNYSVGSRPL 2.4928085 0.3131412 1.04467800 1.9926449
## 19 RANBP3;58; KRERTSSLTHSEE 1.4981902 0.4456912 0.02024329 0.6831158
## 20 TSC2;1466; LRPRGYTISDSAP 2.2595063 0.3103911 1.09374609 2.2927237
## 21 TSC2;939; FRARSTSLNERPK 1.7348112 0.2579600 0.85456623 1.6825158
## 22 TSC2;981; FRCRSISVSEHVV 2.0218485 0.2256770 0.64320312 1.7322447
## X1m X2m X5m X10m X20m X60m Ins.1
## 1 3.6929821 3.6410960 4.0850147 3.8988759 3.9781234 4.0119808 2.4435006
## 2 2.7808397 2.8150416 3.1520775 2.9641961 2.8741745 3.0634861 2.5618882
## 3 3.8662323 3.8460953 4.1608367 4.3124341 4.0466843 3.9049719 3.0435020
## 4 3.8968884 3.3103308 2.7608542 3.2505217 1.6278092 2.4758485 1.9556848
## 5 1.0895884 1.3932881 1.1977503 1.4687109 1.4838691 1.3462286 0.8771160
## 6 1.8278517 2.3240815 2.0455537 2.1632451 2.9987398 3.0828541 1.5203971
## 7 0.6383167 0.6397195 0.7636132 0.8609022 0.7909329 0.6644715 0.5115692
## 8 3.5337725 4.3957991 4.5197697 2.9879681 2.9856010 3.1076649 2.7818330
## 9 0.6641306 0.7669697 0.8166649 1.1021829 1.5026015 0.8983615 0.8567865
## 10 3.7485577 4.2097757 3.6186715 3.4755719 3.4267511 3.9710105 2.8694427
## 11 2.4548910 2.8385740 2.4755004 1.6673830 1.7489870 1.5245591 1.5571069
## 12 2.0253917 2.2009603 1.9187483 1.8104378 1.6001571 2.0746572 1.8268434
## 13 2.5860696 2.7226190 2.2595069 2.1286849 1.6998950 2.4300785 1.6249572
## 14 2.2860361 2.3195207 2.5772457 2.7773139 2.7513025 2.6196770 2.0322772
## 15 4.2802085 3.8039699 3.2472222 3.6922333 3.4138487 2.6260700 2.6677332
## 16 3.7084127 4.2649214 3.5941510 2.9888252 3.7965945 3.8249220 3.0211066
## 17 2.0548841 2.6590394 2.7341940 2.6148573 2.8127280 2.5922604 2.1981597
## 18 2.5012813 2.5750736 2.7087517 2.7991921 3.3598561 2.9609903 2.1681606
## 19 1.2336567 1.8135842 2.0102513 2.4843846 1.8031033 1.9371825 1.3780121
## 20 1.8103598 2.4672887 2.3265719 3.0254502 2.2898481 2.7700618 2.2456478
## 21 1.9670307 1.7871966 1.6364653 1.9876280 1.8068095 2.1562772 1.7609448
## 22 2.2642534 2.4280360 2.3883727 2.3669369 2.0835106 2.2682308 1.7042523
## LY Ins.2 MK aktSubstrate
## 1 0.90188056 2.6152986 -0.84959630 1
## 2 0.35895883 2.6616383 -0.60092737 1
## 3 2.08987098 4.2587563 1.99019248 1
## 4 1.35371850 3.2326608 -0.77850472 1
## 5 0.41565051 1.0552651 -0.27522751 1
## 6 0.64247054 2.7904801 1.29519404 1
## 7 -0.09615960 0.5439719 0.35782560 1
## 8 -0.02538751 2.5012608 0.63932479 1
## 9 0.03040671 0.6190486 0.33319472 1
## 10 1.27535861 3.2895524 0.06694804 1
## 11 0.60681582 1.9505807 0.69546036 1
## 12 0.76553475 1.9555741 0.74038355 1
## 13 0.72180764 1.8650874 1.20263838 1
## 14 0.81532949 2.1558464 1.47421174 1
## 15 2.06767313 3.5093158 2.23333590 1
## 16 1.05580714 1.5761317 -1.17459827 1
## 17 1.27485624 0.7385128 1.58090204 1
## 18 1.22817222 2.1286233 1.56554502 1
## 19 0.70531444 1.5717253 1.21182212 1
## 20 1.03393391 2.5180929 -0.49254328 1
## 21 0.75034924 1.4731697 -0.80007329 1
## 22 0.60131618 1.9982121 0.19181529 1
# Repeating the same process as above for mTOR substrates:
dat$mTORSubstrate <- 0
dat$mTORSubstrate[dat$Identifier %in% mTOR_Substrates$Identifier] <- 1
dat %>% filter(mTORSubstrate == 1) # There should be 26 Akt Substrates. Check to make sure this is the case.
## Identifier Seq.Window Avg.Fold AUC X15s
## 1 AHNAK;5536; VEAEASSPKGKFS 0.42633446 0.6567462 -0.176054689
## 2 AKT1S1;255; TQQYAKSLPVSVP 0.98999183 0.4360069 -0.007265314
## 3 ANKRD17;2041; PVSSPSSPSPPAQ 0.64920432 0.5362401 0.009309093
## 4 APRF;727; TIDLPMSPRTLDS 0.55215123 0.5233701 -0.426227432
## 5 DEPDC6;265; TSFMSVSPSKEIK 0.51836151 0.6913704 0.019358169
## 6 DEPDC6;293; SGYFSSSPTLSSS 0.69685396 0.6831728 0.541223454
## 7 DEPDC6;299; SPTLSSSPPVLCN 0.40443692 0.4907117 0.432595528
## 8 EIF4EBP1;36; PGDYSTTPGGTLF 0.13496319 0.4781614 0.053892242
## 9 EIF4EBP1;64; LMECRNSPVAKTP 0.71742741 0.5722157 -0.065138500
## 10 EIF4EBP2;37; PQDYCTTPGGTLF 0.43604513 0.5058809 -0.182937288
## 11 EIF4EBP2;46; GTLFSTTPGGTRI 0.17340754 0.3716139 0.232247973
## 12 FRAP;2481; VPESIHSFIGDGL 0.56419310 0.4557192 0.157769811
## 13 GRB10;503; NILSSQSPLHPST 0.81454942 0.5174816 -0.751905100
## 14 IRS1;632; GDYMPMSPKSVSA 1.49906522 0.5938386 -0.030464601
## 15 MAF1;60; HVLEALSPPQTSG 1.19491669 0.5144249 -0.670922080
## 16 MAF1;68; PQTSGLSPSRLSK 1.00365258 0.5403878 -0.555997216
## 17 NUMA1;1982; THQGPGTPESKKA 0.83731208 0.6926504 -0.587028000
## 18 PATL1;184; TSPIIGSPPVRAV 1.03362489 0.5261557 0.128700234
## 19 PATL1;194; RAVPIGTPPKQMA 1.00472797 0.5992069 0.436770044
## 20 RAPTOR;863; TQSAPASPTNKGM 0.84801251 0.5054830 0.120069906
## 21 RAPTOR;877; MHQVGGSPPASST -0.00532214 0.6726739 -0.191369510
## 22 RPS6KB1;427; SVKEKFSFEPKIR 2.23146956 0.5458955 0.084781349
## 23 RPS6KB1;444; FIGSPRTPVSPVK 1.82587943 0.5360684 0.590065853
## 24 RPS6KB1;452; VSPVKFSPGDFWG 2.23331222 0.5497894 2.291144987
## 25 TBC1D10B;693; LHPSLPSPTGNST 0.80164891 0.5476402 0.144077311
## 26 ULK1;763; VVFTVGSPPSGAT 0.72853856 0.7018710 0.533573344
## X30s X1m X2m X5m X10m X20m
## 1 -0.217390655 -0.334683918 -0.44519263 0.25533667 1.0464213 1.5109363
## 2 0.160027912 0.547349427 1.07587947 1.50442263 1.5013697 1.5893041
## 3 -0.015606857 0.207165706 0.34081470 1.00575082 1.2573407 1.1973539
## 4 -0.964410775 -0.194466781 -0.24003503 1.39900189 1.4489483 2.0398072
## 5 -0.156930912 -0.134111736 0.33140181 0.57627243 0.7510608 1.0707849
## 6 0.067775694 0.409180800 0.53987771 1.02175763 2.0597414 -0.0228508
## 7 -0.070780177 0.186650146 0.23696274 1.20723293 0.3657260 1.8393274
## 8 -0.090965206 -0.008004236 0.16407134 0.17895159 0.2498839 0.2264721
## 9 -0.044541666 0.111832061 0.30066749 1.08497803 1.5636114 1.4150152
## 10 0.191076439 1.019805586 0.50131281 0.46518926 0.5337301 0.5652371
## 11 0.082650222 -0.208825550 0.47783821 -1.17673824 0.5174982 0.5786523
## 12 0.061129258 -0.142141769 0.41436027 1.01980520 1.0373473 0.9354886
## 13 -0.118098662 -0.001528758 0.06968892 1.36414164 1.6279706 2.2219585
## 14 -0.053877677 0.009467811 0.72018799 2.30840705 2.8996614 3.3386457
## 15 -0.564868134 0.178840508 0.62393093 2.20634190 2.6196297 2.8769290
## 16 -0.200317130 0.256986250 1.17604053 1.22469445 1.9409463 1.6891156
## 17 -0.643446989 -0.491875735 -0.76117779 -0.08562473 1.9657848 3.7091163
## 18 0.010209789 -0.182544297 0.50174870 1.93246904 1.8129093 2.1285754
## 19 0.016577667 0.088870846 0.37768135 0.89375770 2.1781633 1.9382400
## 20 0.075364731 0.323780019 0.55571219 1.40396490 1.5468006 1.4289157
## 21 -0.202757468 -0.099480367 -0.05867325 -0.12431169 0.1100442 0.1877505
## 22 0.512462410 0.177735972 0.68961107 3.59354117 4.2936110 4.1805202
## 23 0.566495011 0.411982801 0.87967942 3.40790889 2.4611260 3.5691473
## 24 0.397956875 0.243112184 0.25134133 2.25508225 3.8001843 4.2717838
## 25 -0.684524590 -0.218289064 -0.03673607 1.26383575 1.8261929 1.8335039
## 26 -0.008488011 -0.263439870 -0.43188586 0.38466278 0.6540405 2.0991027
## X60m Ins.1 LY Ins.2 MK aktSubstrate
## 1 1.7713032 1.7586017 1.22391675 2.1768015 1.72792046 0
## 2 1.5488467 1.2284801 -1.48212971 1.3697177 -0.34351278 0
## 3 1.1915065 1.0306892 0.19244669 0.3847130 0.33529916 0
## 4 1.3545924 1.8353582 1.11962178 1.9947969 2.10989541 0
## 5 1.6890566 1.1304697 -0.32769234 0.9984844 -0.06444973 0
## 6 0.9581258 2.6999513 -2.04114990 1.0310333 0.94417969 0
## 7 -0.9622192 -0.1700383 0.17638227 -0.9213689 -0.77147990 0
## 8 0.3054039 0.2788174 -2.44748521 0.4462754 -0.21536788 0
## 9 1.3729952 0.8096614 -1.21943792 1.0433781 0.48140228 0
## 10 0.3949470 0.8233308 -3.25529820 0.6229304 0.28226187 0
## 11 0.8839371 0.8063951 -2.80701057 -0.2063343 -0.40697727 0
## 12 1.0297861 0.7931051 -1.64855258 2.1314578 -0.28510830 0
## 13 2.1041683 1.9955565 -1.98625969 1.8308322 0.57329966 0
## 14 2.8004941 2.6139795 1.59579007 2.9494317 2.68473505 0
## 15 2.2894517 2.6095269 -0.84347478 2.3496141 -0.63037407 0
## 16 2.4977519 2.5119487 -1.26711434 2.2076210 0.49737663 0
## 17 3.5927488 0.1467855 0.03266511 3.7977174 2.91975080 0
## 18 1.9369309 1.8985180 -2.59797642 1.7865717 -0.43711530 0
## 19 2.1077629 1.7193617 0.59199044 2.2427720 0.13333731 0
## 20 1.3294921 1.2082051 0.49251979 1.1405824 1.04180369 0
## 21 0.3362205 0.7923556 -0.36974365 0.2824731 0.05479087 0
## 22 4.3194933 3.0319245 -5.77540647 3.7017708 0.92052198 0
## 23 2.7206302 1.1003722 3.11909052 2.3925158 4.26581642 0
## 24 4.3558920 3.7749452 -2.88683294 3.1002713 0.15317365 0
## 25 2.2851311 0.9178701 1.34152331 1.6354072 1.40732826 0
## 26 2.8607429 1.9964985 -1.27744990 1.7056683 -0.44742619 0
## mTORSubstrate
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
## 7 1
## 8 1
## 9 1
## 10 1
## 11 1
## 12 1
## 13 1
## 14 1
## 15 1
## 16 1
## 17 1
## 18 1
## 19 1
## 20 1
## 21 1
## 22 1
## 23 1
## 24 1
## 25 1
## 26 1
dat$mTORSubstrate <- as.factor(dat$mTORSubstrate)
# Check for NAs in the master dataset, to identify whether imputation or field/record removal is required
sum(is.na(dat)) # No NA's in dataset. No imputation or data removal required.
## [1] 0
A series of visualisations were created to better understand the data:
a) Visualise the data to understand patterns between the features, and identify potential new features.
b) Visualise the log2 time-series insulin stimulated phosphorylation profiles fold change.
c) Scatter plots
d) Violin plots
e) kernel density plots
# Create a dataframe of the time series data, renaming the time columns to allow for ordering.
dat.timeseries <- dat
dat.timeseries <- dat.timeseries %>% dplyr::rename("1" = "X15s")
dat.timeseries <- dat.timeseries %>% dplyr::rename("2" = "X30s")
dat.timeseries <- dat.timeseries %>% dplyr::rename("3" = "X1m")
dat.timeseries <- dat.timeseries %>% dplyr::rename("4" = "X2m")
dat.timeseries <- dat.timeseries %>% dplyr::rename("5" = "X5m")
dat.timeseries <- dat.timeseries %>% dplyr::rename("6" = "X10m")
dat.timeseries <- dat.timeseries %>% dplyr::rename("7" = "X20m")
dat.timeseries <- dat.timeseries %>% dplyr::rename("8" = "X60m")
# Use melt to stack all the time-series data into a single column for plotting. Order dataframe by Identifier.
my.result <- melt(dat.timeseries, id = c("Identifier","Seq.Window","Avg.Fold", "AUC","Ins.1","LY","Ins.2","MK","aktSubstrate","mTORSubstrate"))
my.result <- my.result[order(my.result$Identifier),]
# The time-series profiles start at different y values. Adjust all y values to start at zero. This allows for an "apples to apples comparison" of the time-series.
my.result.var1 <- my.result %>% filter(variable == 1) %>% select(Identifier, value)
my.result2 <- my.result %>% merge(my.result.var1, by="Identifier", all.x=TRUE) %>% mutate(value.offset = value.x - value.y)
Plot the time-series response curves for Akt and mTOR to examine differences in their profiles.
ggplot(my.result2 %>% filter(aktSubstrate==1 | mTORSubstrate == 1))+
geom_line(aes(x=variable, y=value.offset, colour = aktSubstrate, group=Identifier))
A notable difference can be seen between the Akt (1) and mTOR (0) response curves between 1 (time = 15s) and 6 (time=10mins). Beyond 6 (time=10mins) both response curse appear to have a similar flat/negative response.
Re-plot the above graph taking the mean of Akt and mTOR. Include error bars of 1 standard deviation.
# Use group-by to combine Akt and mTOR samples at each time interval and calculate the mean and standard deviation.
my.result2.grouped <- my.result2 %>% group_by(aktSubstrate, mTORSubstrate, variable) %>% dplyr::summarize(value.offset.mn = mean(value.offset), value.offset.var = var(value.offset)) %>% data.frame %>% mutate(value.offset.sd = sqrt(value.offset.var))
# Filter for rows containing know Akt and mTOR substrates (i.e. remove all unknown rows)
my.result2.grouped
## aktSubstrate mTORSubstrate variable value.offset.mn value.offset.var
## 1 0 0 1 0.000000000 0.0000000
## 2 0 0 2 0.002683066 0.2315073
## 3 0 0 3 0.027998540 0.3605401
## 4 0 0 4 0.072980073 0.4372112
## 5 0 0 5 0.209028927 0.7272481
## 6 0 0 6 0.294750525 0.9976316
## 7 0 0 7 0.310681932 1.0937986
## 8 0 0 8 0.313309799 1.1477497
## 9 0 1 1 0.000000000 0.0000000
## 10 0 1 2 -0.154828787 0.2352914
## 11 0 1 3 -0.009111597 0.3806842
## 12 0 1 4 0.235570723 0.5077636
## 13 0 1 5 1.093867783 1.1842623
## 14 0 1 6 1.536133636 1.0623229
## 15 0 1 7 1.780329361 1.4661913
## 16 0 1 8 1.728650862 1.5166222
## 17 1 0 1 0.000000000 0.0000000
## 18 1 0 2 0.915977114 0.3683177
## 19 1 0 3 1.316928956 0.4384971
## 20 1 0 4 1.512899211 0.5600328
## 21 1 0 5 1.411754051 0.7790154
## 22 1 0 6 1.404033533 0.7688073
## 23 1 0 7 1.315578573 0.9530284
## 24 1 0 8 1.380574848 0.8755800
## value.offset.sd
## 1 0.0000000
## 2 0.4811520
## 3 0.6004499
## 4 0.6612195
## 5 0.8527884
## 6 0.9988151
## 7 1.0458483
## 8 1.0713308
## 9 0.0000000
## 10 0.4850684
## 11 0.6169961
## 12 0.7125753
## 13 1.0882382
## 14 1.0306905
## 15 1.2108639
## 16 1.2315122
## 17 0.0000000
## 18 0.6068918
## 19 0.6621911
## 20 0.7483534
## 21 0.8826185
## 22 0.8768166
## 23 0.9762317
## 24 0.9357243
my.result2.grouped %>% filter(aktSubstrate == 1 | mTORSubstrate == 1)
## aktSubstrate mTORSubstrate variable value.offset.mn value.offset.var
## 1 0 1 1 0.000000000 0.0000000
## 2 0 1 2 -0.154828787 0.2352914
## 3 0 1 3 -0.009111597 0.3806842
## 4 0 1 4 0.235570723 0.5077636
## 5 0 1 5 1.093867783 1.1842623
## 6 0 1 6 1.536133636 1.0623229
## 7 0 1 7 1.780329361 1.4661913
## 8 0 1 8 1.728650862 1.5166222
## 9 1 0 1 0.000000000 0.0000000
## 10 1 0 2 0.915977114 0.3683177
## 11 1 0 3 1.316928956 0.4384971
## 12 1 0 4 1.512899211 0.5600328
## 13 1 0 5 1.411754051 0.7790154
## 14 1 0 6 1.404033533 0.7688073
## 15 1 0 7 1.315578573 0.9530284
## 16 1 0 8 1.380574848 0.8755800
## value.offset.sd
## 1 0.0000000
## 2 0.4850684
## 3 0.6169961
## 4 0.7125753
## 5 1.0882382
## 6 1.0306905
## 7 1.2108639
## 8 1.2315122
## 9 0.0000000
## 10 0.6068918
## 11 0.6621911
## 12 0.7483534
## 13 0.8826185
## 14 0.8768166
## 15 0.9762317
## 16 0.9357243
# Plot mean response including 1 standard deviation bar.
ggplot(my.result2.grouped %>% filter(aktSubstrate==1 | mTORSubstrate == 1))+
geom_point(aes(x=variable, y=value.offset.mn, colour = aktSubstrate, group=aktSubstrate)) +
geom_line(aes(x=variable, y=value.offset.mn, colour = aktSubstrate, group=aktSubstrate)) +
geom_errorbar(width=.4, aes(x=variable,ymin=value.offset.mn-value.offset.sd, ymax=value.offset.mn+value.offset.sd,colour=aktSubstrate))
The first four time segments (1 - 4) show distinctly different responses. Beyond time segment 4 the responses between Akt and mTOR are less distinct.
Use scatter plots to investigate if any Ins.1, LY, Ins.2, MK responses exhibit any noticeable difference between Akt and mTOR that may help differentiate between them.
# features are not time based. Use master dataframe "dat"
cat.dat <- dat %>% mutate(Type = case_when(aktSubstrate==1~"Akt", mTORSubstrate == 1~"mTOR", TRUE~"Other")) %>% select(Identifier, Seq.Window, Avg.Fold, AUC, Ins.1, LY, Ins.2, MK, aktSubstrate, mTORSubstrate, Type)
colnames(cat.dat)
## [1] "Identifier" "Seq.Window" "Avg.Fold" "AUC"
## [5] "Ins.1" "LY" "Ins.2" "MK"
## [9] "aktSubstrate" "mTORSubstrate" "Type"
head(cat.dat)
## Identifier Seq.Window Avg.Fold AUC Ins.1
## 1 100043914;88; GRHERRSSAEQHS 1.0315122 0.2967288 0.6526217
## 2 100043914;89; RHERRSSAEQHSE 1.2160532 0.5279128 0.6112755
## 3 1110018J18RIK;18; CAGRVPSPAGSVT 0.2860675 0.5240430 -0.1988757
## 4 1110037F02RIK;133; SVDRVISHDRDSP -0.4100882 0.6480379 0.1735112
## 5 1110037F02RIK;138; ISHDRDSPPPPPP -0.3664219 0.5002719 0.2292187
## 6 1110037F02RIK;1628; FLSEPSSPGRSKT -0.2148644 0.4370191 0.2987756
## LY Ins.2 MK aktSubstrate mTORSubstrate Type
## 1 0.34193177 2.1570012 2.13792056 0 0 Other
## 2 0.04620434 3.0963462 3.22683218 0 0 Other
## 3 -0.14777344 -0.2610874 -0.02748394 0 0 Other
## 4 -0.57074291 0.3012657 -0.13305247 0 0 Other
## 5 -0.38138545 0.4130244 -0.13305247 0 0 Other
## 6 -0.45838604 0.1086583 0.02536763 0 0 Other
Use cat.dat to check the distributions of the categorical features.
Draw scatter plots of Ins.1 and LY to see if the features exhibit any noticeable difference between Akt and mTOR that may help differentiate between them.
sp <- ggplot(cat.dat) +
geom_point(aes(x=Ins.1, y=LY, color=Type)) +
xlab("Ins.1") +
ylab("LY") +
theme_light()
sp
Draw scatter plots of Ins.2 and MK to see if the features exhibit any noticeable difference between Akt and mTOR that may help differentiate between them.
sp <- ggplot(cat.dat) +
geom_point(aes(x=Ins.2, y=MK, color=Type)) +
xlab("Ins.2") +
ylab("MK") +
theme_light()
sp
sp <- ggplot(cat.dat) +
geom_point(aes(x=Ins.2, y=Ins.1, color=Type)) +
xlab("Ins.1") +
ylab("Ins.2") +
theme_light()
sp
sp <- ggplot(cat.dat) +
geom_point(aes(x=Ins.2, y=MK, color=Type)) +
xlab("Ins.2") +
ylab("MK") +
theme_light()
sp
require(gridExtra)
## Loading required package: gridExtra
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:randomForest':
##
## combine
p1 <- ggplot(cat.dat) +
geom_boxplot(aes(x=Type, y=Avg.Fold, color=Type)) +
xlab("Type") +
ylab("Avg.Fold") +
ggtitle("Box Plot") +
theme_light()
p2 <- ggplot(cat.dat) +
geom_violin(aes(x=Type, y=Avg.Fold, color=Type), scale = "area") +
xlab("Type") +
ylab("AUC") +
ggtitle("Violin Plot") +
theme_light()
grid.arrange(p1, p2, ncol=1)
p1 <- ggplot(cat.dat) +
geom_boxplot(aes(x=Type, y=AUC, color=Type)) +
xlab("Type") +
ylab("AUC") +
ggtitle("Box Plot") +
theme_light()
p2 <- ggplot(cat.dat) +
geom_violin(aes(x=Type, y=AUC, color=Type), scale = "area") +
xlab("Type") +
ylab("AUC") +
ggtitle("Violin Plot") +
theme_light()
grid.arrange(p1, p2, ncol=1)
Akt and mTOR exhibit quite a significant difference in AUC, which is another statistic that comes from the shape of the Akt and mTOR response curves.
p1 <- ggplot(cat.dat) +
geom_density(aes(x=AUC, color=Type), n=1000, bw=0.025, trim=F) +
scale_x_discrete(name="Type", limits=c(0,0.25,0.5,0.75,1.0,1.25)) +
xlab("Type") +
ylab("AUC") +
ggtitle("Kernel Density of AUC") +
theme_light()
p1
p1 <- ggplot(cat.dat) +
geom_density(aes(x=Avg.Fold, color=Type), n=1000, trim=F) +
xlab("Type") +
ylab("Avg.Fold") +
ggtitle("Kernel Density of Avg.Fold") +
theme_light()
p1
Four classification models were built:
1. Random Forest.
2. Support Vector Machine (SVM).
3. K-Nearest Neighbours (KNN).
4. Logistic Regression.
The purpose of building four models was to examine which model performed best. Two of the models are linear classifiers (SVM and Logistic Regression) and two models are non-linear (Random Forest and KNN). Feature creation was trialed and feature selection was also trialed with different models.
Create a baseline Random Forest model for prediction of Akt substrates and mTOR substrates
# Create a subset dataframe that contains samples with known Akt or mTOR (i.e. exclude all unlabeled samples).
# Remove identifier, seq. window and mTOR substrate flag from subset.
known.dat <- dat[dat$aktSubstrate == 1 | dat$mTORSubstrate == 1, c(-1,-2,-18)]
# This subset dataframe is reasonably balanced in term of Akt (22) and mTOR (26) samples so there is no need for up or down sampling for the baseline.
# Determine the maximum number of predictors that can be used in the random forest
max.predictors <- length(colnames(known.dat)) - 1 # Substracting one as the last column is the response variable
set.seed(1) # Set seed to randomise the rows included in each fold
folds <- createFolds(known.dat[,"aktSubstrate"], 10) # Create 10 folds for cross validation
# Create a matrix to hold the results from the baseline random forest.
# Trail from 1 to 100 number of trees (number of rows = 100)
# Trail from 1 to max.predictors randomly sampled canditate variables for each split
gridSearchResults <- matrix(nrow=100, ncol=max.predictors)
for (m in 1:100){
for(k in 1:max.predictors){
fold.accuracy <- NULL
for (tst in folds){
known.dat.train = known.dat[-tst,]
known.dat.test = known.dat[tst,]
rf.Baseline <- randomForest(aktSubstrate~., data=known.dat.train, mtry=k, replace=TRUE, importance=TRUE, ntree=m)
ypred <- predict(rf.Baseline, newdata=known.dat.test)
ypred.accuracy <- sum(ypred == known.dat.test$aktSubstrate)/nrow(known.dat.test)
if (is.null(fold.accuracy)){
fold.accuracy <- ypred.accuracy
}
else {
fold.accuracy <- fold.accuracy + ypred.accuracy
}
#print(ypred.accuracy)
}
avg.accuracy <- fold.accuracy / length(folds)
gridSearchResults[m, k] <- avg.accuracy
}
}
Baseline random forest model predicts aktSubstrate from mTOR with accuracy of 81.5% (min) to 97.5% (max) using existing 14 features
summary(gridSearchResults) # Summaries the model results
## V1 V2 V3 V4
## Min. :0.8150 Min. :0.8817 Min. :0.8717 Min. :0.8600
## 1st Qu.:0.9350 1st Qu.:0.9383 1st Qu.:0.9500 1st Qu.:0.9300
## Median :0.9550 Median :0.9550 Median :0.9550 Median :0.9550
## Mean :0.9408 Mean :0.9519 Mean :0.9550 Mean :0.9476
## 3rd Qu.:0.9550 3rd Qu.:0.9550 3rd Qu.:0.9750 3rd Qu.:0.9550
## Max. :0.9750 Max. :1.0000 Max. :0.9833 Max. :0.9750
## V5 V6 V7 V8
## Min. :0.8767 Min. :0.8900 Min. :0.8900 Min. :0.8817
## 1st Qu.:0.9300 1st Qu.:0.9300 1st Qu.:0.9133 1st Qu.:0.9100
## Median :0.9300 Median :0.9300 Median :0.9300 Median :0.9300
## Mean :0.9339 Mean :0.9325 Mean :0.9270 Mean :0.9233
## 3rd Qu.:0.9500 3rd Qu.:0.9500 3rd Qu.:0.9300 3rd Qu.:0.9300
## Max. :0.9750 Max. :0.9750 Max. :0.9750 Max. :0.9550
## V9 V10 V11 V12
## Min. :0.8767 Min. :0.8900 Min. :0.8733 Min. :0.8900
## 1st Qu.:0.9100 1st Qu.:0.9100 1st Qu.:0.9300 1st Qu.:0.9100
## Median :0.9300 Median :0.9300 Median :0.9300 Median :0.9300
## Mean :0.9228 Mean :0.9252 Mean :0.9265 Mean :0.9248
## 3rd Qu.:0.9300 3rd Qu.:0.9300 3rd Qu.:0.9300 3rd Qu.:0.9300
## Max. :0.9600 Max. :0.9550 Max. :0.9750 Max. :0.9750
## V13 V14
## Min. :0.8900 Min. :0.8767
## 1st Qu.:0.9100 1st Qu.:0.9300
## Median :0.9300 Median :0.9300
## Mean :0.9236 Mean :0.9258
## 3rd Qu.:0.9300 3rd Qu.:0.9300
## Max. :0.9300 Max. :0.9500
p <- plot_ly(z = ~gridSearchResults) %>% add_surface()
p
Create new features for the Random Forest model to see if performance can be improved. The sequence window column contains the sequence of the phosphorylation site. It is a 13-amino acid sequence in which the 7th position corresponds to the amino acid that is phosphorylated. Breakout the sequence window into individual features.
require(stringr)
cat.dat$Seq.Window <- as.character(cat.dat$Seq.Window)
str_length(cat.dat[1,]$Seq.Window)
## [1] 13
# Create a new column for each of the each of the amino acids.
cat.dat$s1 <- as.factor(str_sub( cat.dat$Seq.Window, 1,1))
cat.dat$s2 <- as.factor(str_sub( cat.dat$Seq.Window, 2,2))
cat.dat$s3 <- as.factor(str_sub( cat.dat$Seq.Window, 3,3))
cat.dat$s4 <- as.factor(str_sub( cat.dat$Seq.Window, 4,4))
cat.dat$s5 <- as.factor(str_sub( cat.dat$Seq.Window, 5,5))
cat.dat$s6 <- as.factor(str_sub( cat.dat$Seq.Window, 6,6))
cat.dat$s7 <- as.factor(str_sub( cat.dat$Seq.Window, 7,7))
cat.dat$s8 <- as.factor(str_sub( cat.dat$Seq.Window, 8,8))
cat.dat$s9 <- as.factor(str_sub( cat.dat$Seq.Window, 9,9))
cat.dat$s10 <- as.factor(str_sub( cat.dat$Seq.Window, 10,10))
cat.dat$s11 <- as.factor(str_sub( cat.dat$Seq.Window, 11,11))
cat.dat$s12 <- as.factor(str_sub( cat.dat$Seq.Window, 12,12))
cat.dat$s13 <- as.factor(str_sub( cat.dat$Seq.Window, 13,13))
Generate plots for each seq character in the seq window position from position 1 to 13
for (seqName in colnames(cat.dat)[12:24]){
myplot <- ggplot(cat.dat, aes_string(x=seqName, group = "Type")) +
geom_bar(aes(y = ..prop.., fill = Type), stat="count") +
scale_y_continuous(labels=scales::percent) +
ylab("relative frequencies") +
ggtitle(seqName)+
facet_grid(~Type)
# myplot
plotFilename <- paste("./Plots/seq_",seqName,".png",sep="_")
ggsave(plotFilename, myplot, dpi=320, device="png")
print(paste("saved plot to: ",plotFilename," ...",sep=""))
}
## Saving 7 x 5 in image
## [1] "saved plot to: ./Plots/seq__s1_.png ..."
## Saving 7 x 5 in image
## [1] "saved plot to: ./Plots/seq__s2_.png ..."
## Saving 7 x 5 in image
## [1] "saved plot to: ./Plots/seq__s3_.png ..."
## Saving 7 x 5 in image
## [1] "saved plot to: ./Plots/seq__s4_.png ..."
## Saving 7 x 5 in image
## [1] "saved plot to: ./Plots/seq__s5_.png ..."
## Saving 7 x 5 in image
## [1] "saved plot to: ./Plots/seq__s6_.png ..."
## Saving 7 x 5 in image
## [1] "saved plot to: ./Plots/seq__s7_.png ..."
## Saving 7 x 5 in image
## [1] "saved plot to: ./Plots/seq__s8_.png ..."
## Saving 7 x 5 in image
## [1] "saved plot to: ./Plots/seq__s9_.png ..."
## Saving 7 x 5 in image
## [1] "saved plot to: ./Plots/seq__s10_.png ..."
## Saving 7 x 5 in image
## [1] "saved plot to: ./Plots/seq__s11_.png ..."
## Saving 7 x 5 in image
## [1] "saved plot to: ./Plots/seq__s12_.png ..."
## Saving 7 x 5 in image
## [1] "saved plot to: ./Plots/seq__s13_.png ..."
Some sequence character character positions show patterns with their amino acid profile (e.g. 45% of S1 for Akt are F)
The visual exploration of the response data showed that Akt and mTOR have distinct profiles in the first four time segments. Features have been created to take advantage of this distinguishing attribute, and hopefully improve model performance.
dat.extra <- dat %>% mutate(diffP1 = X30s - X15s,
diffP2 = X1m - X30s,
diffP3 = X2m - X1m,
diffP4 = X5m - X2m,
diffP5 = X10m - X5m,
diffP6 = X20m - X10m,
diffP7 = X60m - X20m,
Type = as.factor(case_when(aktSubstrate==1~"Akt", mTORSubstrate == 1~"mTOR", TRUE~"Other")),
s1 = as.factor(str_sub( Seq.Window, 1,1)),
s2 = as.factor(str_sub( Seq.Window, 2,2)),
s3 = as.factor(str_sub( Seq.Window, 3,3)),
s4 = as.factor(str_sub( Seq.Window, 4,4)),
s5 = as.factor(str_sub( Seq.Window, 5,5)),
s6 = as.factor(str_sub( Seq.Window, 6,6)),
s7 = as.factor(str_sub( Seq.Window, 7,7)),
s8 = as.factor(str_sub( Seq.Window, 8,8)),
s9 = as.factor(str_sub( Seq.Window, 9,9)),
s10 = as.factor(str_sub( Seq.Window, 10,10)),
s11 = as.factor(str_sub( Seq.Window, 11,11)),
s12 = as.factor(str_sub( Seq.Window, 12,12)),
s13 = as.factor(str_sub( Seq.Window, 13,13))
)
Test the influence of the new features on the accuracy of the random forest model by comparing to the baseline accuracy.
# Use identical procedure as the baseline so the influence of the new feasures can be compared to the baseline
dat.extra.rf <- dat.extra %>% filter(Type != "Other") %>% select(-Identifier, -Seq.Window, -mTORSubstrate, -Type)
folds <- createFolds(dat.extra.rf$aktSubstrate, 10)
max.predictors.dat.extra <- length(colnames(dat.extra.rf))-1
gridSearchResults.extra <- matrix(nrow=100, ncol=max.predictors.dat.extra)
for (m in 1:100){
for(k in 1:max.predictors.dat.extra){
fold.accuracy <- NULL
for (tst in folds){
dat.extra.rf.train = dat.extra.rf[-tst,]
dat.extra.rf.test = dat.extra.rf[tst,]
rf.ExtraFeatures <- randomForest(aktSubstrate~., data=dat.extra.rf.train, mtry=k, replace=TRUE, importance=TRUE, ntree=m) # importance based on
ypred.prob <- predict(rf.ExtraFeatures, newdata=dat.extra.rf.test, type="prob")
ypred <- ifelse(ypred.prob[,2] > 0.5, 1, 0)
ypred.accuracy <- sum(ypred == dat.extra.rf.test$aktSubstrate)/nrow(dat.extra.rf.test)
if (is.null(fold.accuracy)){
fold.accuracy <- ypred.accuracy
}
else {
fold.accuracy <- fold.accuracy + ypred.accuracy
}
}
avg.accuracy <- fold.accuracy / length(folds)
avg.accuracy
gridSearchResults.extra[m, k] <- avg.accuracy
}
}
# View(gridSearchResults.extra)
write.csv(gridSearchResults.extra, './Output/gridSearchResults.csv')
# 5 trees and 4 predictors is all that's needed
require(plotly)
p.extra <- plot_ly(z = ~gridSearchResults.extra) %>% add_surface()
p.extra
rf.ExtraFeatures$importance
## 0 1 MeanDecreaseAccuracy MeanDecreaseGini
## Avg.Fold 0.00000000 0.000000000 0.000000000 0.0000000
## AUC 0.00500000 0.004761905 0.004862745 0.4065116
## X15s 0.00000000 0.000000000 0.000000000 0.0000000
## X30s 0.02229545 0.028321429 0.024428105 1.4381395
## X1m 0.09437671 0.133174603 0.105734554 5.2558140
## X2m 0.00000000 0.000000000 0.000000000 0.0000000
## X5m 0.00000000 0.000000000 0.000000000 0.0000000
## X10m 0.00000000 0.000000000 0.000000000 0.0000000
## X20m 0.00000000 0.000000000 0.000000000 0.0000000
## X60m 0.00000000 0.000000000 0.000000000 0.0000000
## Ins.1 0.00000000 0.000000000 0.000000000 0.0000000
## LY 0.00000000 0.000000000 0.000000000 0.0000000
## Ins.2 0.00000000 0.000000000 0.000000000 0.0000000
## MK 0.00000000 0.000000000 0.000000000 0.0000000
## diffP1 0.00000000 0.000000000 0.000000000 0.0000000
## diffP2 0.00000000 0.000000000 0.000000000 0.0000000
## diffP3 0.00000000 0.000000000 0.000000000 0.0000000
## diffP4 0.00250000 0.004285714 0.003157895 0.2139535
## diffP5 0.00000000 0.000000000 0.000000000 0.0000000
## diffP6 0.00000000 0.000000000 0.000000000 0.0000000
## diffP7 0.00000000 0.000000000 0.000000000 0.0000000
## s1 0.00000000 0.000000000 0.000000000 0.0000000
## s2 0.11496898 0.125702381 0.116270803 7.3841860
## s3 0.00000000 0.000000000 0.000000000 0.0000000
## s4 0.08455808 0.092642857 0.083578302 6.3293023
## s5 0.00000000 0.000000000 0.000000000 0.0000000
## s6 0.00000000 0.000000000 0.000000000 0.0000000
## s7 0.00000000 0.000000000 0.000000000 0.0000000
## s8 0.00000000 0.000000000 0.000000000 0.0000000
## s9 0.00000000 0.000000000 0.000000000 0.0000000
## s10 0.00000000 0.000000000 0.000000000 0.0000000
## s11 0.00000000 0.000000000 0.000000000 0.0000000
## s12 0.00000000 0.000000000 0.000000000 0.0000000
## s13 0.00000000 0.000000000 0.000000000 0.0000000
max(gridSearchResults.extra)
## [1] 1
Below is the summary of Random forest model and proposal for next steps.
a) The average out of bag (OOB) estimate of error rate decreases from 9.09% to 6.98% with the new features added..
b) Some mtry and number of tree configurations achieve 100% accuracy..
c) Apply adaptive sampling procedure to the dataset in order to classify the unlabeled data..
For the SVM model no features selection or creation will be used.
Prepare data for SVM model
names(dat)[c(1,2,18)]
## [1] "Identifier" "Seq.Window" "mTORSubstrate"
# create an index to remember which rows are positively labelled as akt
aktIdx = which(dat$aktSubstrate==1)
# create an index to remember which rows are positively labelled as mTOR
mTORIdx = which(dat$mTORSubstrate==1)
aktSubstrate.dat <- dat[c(-1,-2,-18)]
# keep a copy for comparison later since we overwrite this feature in the line below
aktSubstrate.dat$aktSubstrate_Original = aktSubstrate.dat$aktSubstrate
# this does not make "0" into 0, and "1" into 1
aktSubstrate.dat$aktSubstrate = as.numeric(as.factor(aktSubstrate.dat$aktSubstrate))
aktSubstrate.dat$aktSubstrate_corrected = as.numeric(as.factor(aktSubstrate.dat$aktSubstrate)) - 1
dfCompare <- data.frame(aktSubstrate.dat$aktSubstrate_Original, aktSubstrate.dat$aktSubstrate, aktSubstrate.dat$aktSubstrate_corrected)
names(dfCompare) <- c("Original", "Factor to Numeric Conversion", "Corrected")
# let's take a look at the original field, the numeric conversion, and the corrected numerical conversion.
# Note: The column types are listed under the column names, [factor, double, double]
head(dfCompare)
## Original Factor to Numeric Conversion Corrected
## 1 0 1 0
## 2 0 1 0
## 3 0 1 0
## 4 0 1 0
## 5 0 1 0
## 6 0 1 0
dim(aktSubstrate.dat)
## [1] 12062 17
aktSubstrate_data.mat <- apply(X = aktSubstrate.dat[,-(15:17)], MARGIN = 2, FUN = as.numeric)
aktSubstrate_data.mat <- aktSubstrate_data.mat[c(aktIdx,mTORIdx),]#TKP Need to remove the unlabeled items, otherwise the class imbalance is too large for the SVM to train.
aktSubstrate_known.dat<- aktSubstrate_data.mat
attr(aktSubstrate_data.mat, "dimnames")[2]
## [[1]]
## [1] "Avg.Fold" "AUC" "X15s" "X30s" "X1m" "X2m"
## [7] "X5m" "X10m" "X20m" "X60m" "Ins.1" "LY"
## [13] "Ins.2" "MK"
data.cls.truth <- sapply(X = aktSubstrate.dat$aktSubstrate_corrected, FUN = function(x) {ifelse(x=="1", 1, 0)}) # try with the corrected truth class instead
data.cls.truth_akt <- data.cls.truth[c(aktIdx,mTORIdx)] # Need to match the indexed rows
rownames(aktSubstrate_data.mat) <- paste("p", 1:nrow(aktSubstrate_data.mat), sep="_")
#data.cls.truth
k <- 10
set.seed(1)
fold <- createFolds(data.cls.truth_akt, 10);
# gold standard (orignal data)
TP <- TN <- FP <- FN <- c()
for(i in 1:length(fold)){
model <- svm(aktSubstrate_data.mat[-fold[[i]],], data.cls.truth_akt[-fold[[i]]])
preds <- ifelse(predict(model, aktSubstrate_data.mat[fold[[i]],]) > 0.5, 1, 0)
TP <- c(TP, sum((data.cls.truth_akt[fold[[i]]] == preds)[data.cls.truth_akt[fold[[i]]] == "1"]))
TN <- c(TN, sum((data.cls.truth_akt[fold[[i]]] == preds)[data.cls.truth_akt[fold[[i]]] == "0"]))
FP <- c(FP, sum((data.cls.truth_akt[fold[[i]]] != preds)[preds == "1"]))
FN <- c(FN, sum((data.cls.truth_akt[fold[[i]]] != preds)[preds == "0"]))
}
mean(F1(cbind(TN, FP, TP, FN)))
## [1] 0.96
mean(Sen(cbind(TN, FP, TP, FN)))
## [1] 0.9333333
mean(Spe(cbind(TN, FP, TP, FN)))
## [1] 1
mean(Acc(cbind(TN, FP, TP, FN)))
## [1] 0.96
Prepare data for SVM model
mTORSubstrate.dat <- dat[c(-1,-2,-17)]
# keep a copy for comparison later since we overwrite this feature in the line below
mTORSubstrate.dat$mTORSubstrate_Original = mTORSubstrate.dat$mTORSubstrate
# this does not make "0" into 0, and "1" into 1
mTORSubstrate.dat$mTORSubstrate = as.numeric(as.factor(mTORSubstrate.dat$mTORSubstrate))
mTORSubstrate.dat$mTORSubstrate_corrected = as.numeric(as.factor(mTORSubstrate.dat$mTORSubstrate)) - 1
dfCompare <- data.frame(mTORSubstrate.dat$mTORSubstrate_Original, mTORSubstrate.dat$mTORSubstrate, mTORSubstrate.dat$mTORSubstrate_corrected)
names(dfCompare) <- c("Original", "Factor to Numeric Conversion", "Corrected")
# let's take a look at the original field, the numeric conversion, and the corrected numerical conversion.
# Note: The column types are listed under the column names, [factor, double, double]
head(dfCompare)
## Original Factor to Numeric Conversion Corrected
## 1 0 1 0
## 2 0 1 0
## 3 0 1 0
## 4 0 1 0
## 5 0 1 0
## 6 0 1 0
dim(mTORSubstrate.dat)
## [1] 12062 17
mTORSubstrate_data.mat <- apply(X = mTORSubstrate.dat[,-(15:17)], MARGIN = 2, FUN = as.numeric) #TKP slightly modified to also drop cols 16 and 17 since I added two additional columns to create dfCompare above.
mTORSubstrate_data.mat <- mTORSubstrate_data.mat[c(aktIdx,mTORIdx),]#TKP Need to remove the unlabeled items, otherwise the class imbalance is too large for the SVM to train.
mTORSubstrate_known.dat<- mTORSubstrate_data.mat
attr(mTORSubstrate_data.mat, "dimnames")[2]
## [[1]]
## [1] "Avg.Fold" "AUC" "X15s" "X30s" "X1m" "X2m"
## [7] "X5m" "X10m" "X20m" "X60m" "Ins.1" "LY"
## [13] "Ins.2" "MK"
data.cls.truth_mTOR <- sapply(X = mTORSubstrate.dat$mTORSubstrate_corrected, FUN = function(x) {ifelse(x=="1", 1, 0)}) #TKP try with the corrected truth class instead
data.cls.truth_mTOR <- data.cls.truth_mTOR[c(aktIdx,mTORIdx)] # Need to match the indexed rows
rownames(mTORSubstrate_data.mat) <- paste("p", 1:nrow(mTORSubstrate_data.mat), sep="_")
#data.cls.truth
k <- 10
set.seed(1)
fold <- createFolds(data.cls.truth_mTOR, 10);
# gold standard (orignal data)
TP <- TN <- FP <- FN <- c()
for(i in 1:length(fold)){
model <- svm(mTORSubstrate_data.mat[-fold[[i]],], data.cls.truth_mTOR[-fold[[i]]])
preds <- ifelse(predict(model, mTORSubstrate_data.mat[fold[[i]],]) > 0.5, 1, 0)
TP <- c(TP, sum((data.cls.truth_mTOR[fold[[i]]] == preds)[data.cls.truth_mTOR[fold[[i]]] == "1"]))
TN <- c(TN, sum((data.cls.truth_mTOR[fold[[i]]] == preds)[data.cls.truth_mTOR[fold[[i]]] == "0"]))
FP <- c(FP, sum((data.cls.truth_mTOR[fold[[i]]] != preds)[preds == "1"]))
FN <- c(FN, sum((data.cls.truth_mTOR[fold[[i]]] != preds)[preds == "0"]))
}
mean(F1(cbind(TN, FP, TP, FN)))
## [1] 0.96
mean(Sen(cbind(TN, FP, TP, FN)))
## [1] 1
mean(Spe(cbind(TN, FP, TP, FN)))
## [1] 0.9333333
mean(Acc(cbind(TN, FP, TP, FN)))
## [1] 0.96
KNN model was built using wrapper method for feature selection to see if feature selection could remove noise from the data and improve model performance.
#### KNN Model for aktSubstrate as class:
selectFeature <- function(train, test, cls.train, cls.test, features) {
## identify a feature to be selected
current.best.accuracy <- -Inf
selected.i <- NULL
for(i in 1:ncol(train)) {
current.f <- colnames(train)[i]
if(!current.f %in% features) {
model <- knn(train=train[,c(features, current.f)], test=test[,c(features, current.f)], cl=cls.train, k=9)
test.acc <- sum(model == cls.test) / length(cls.test)
if(test.acc > current.best.accuracy) {
current.best.accuracy <- test.acc
selected.i <- colnames(train)[i]
}
}
}
return(selected.i)
}
set.seed(1)
inTrain <- createDataPartition(data.cls.truth_akt, p = .6)[[1]]
allFeatures <- colnames(aktSubstrate_data.mat)
train <- aktSubstrate_data.mat[ inTrain,]
test <- aktSubstrate_data.mat[-inTrain,]
cls.train <- data.cls.truth_akt[inTrain]
cls.test <- data.cls.truth_akt[-inTrain]
# use correlation to determine the first feature
cls.train.numeric <- rep(c(0, 1), c(sum(cls.train == "0"), sum(cls.train == "1")))
features <- c()
current.best.cor <- 0
for(i in 1:ncol(train)) {
if(current.best.cor < abs(cor(train[,i], cls.train.numeric))) {
current.best.cor <- abs(cor(train[,i], cls.train.numeric))
features <- colnames(train)[i]
}
}
print(features)
## [1] "LY"
for (j in 2:10) {
selected.i <- selectFeature(train, test, cls.train, cls.test, features)
print(selected.i)
# add the best feature from current run
features <- c(features, selected.i)
}
## [1] "X1m"
## [1] "AUC"
## [1] "X15s"
## [1] "Avg.Fold"
## [1] "X30s"
## [1] "X2m"
## [1] "X10m"
## [1] "X60m"
## [1] "Ins.1"
features_selection_aktSubstrate_data.mat <- aktSubstrate_data.mat[,features]
k <- 10
set.seed(1)
fold <- createFolds(data.cls.truth_akt, 10);
# gold standard (orignal data)
TP <- TN <- FP <- FN <- c()
for(i in 1:length(fold)){
#model <- svm(features_selection_aktSubstrate_data.mat[-fold[[i]],], data.cls.truth_akt[-fold[[i]]])
preds <- knn (train=features_selection_aktSubstrate_data.mat[-fold[[i]],], test=features_selection_aktSubstrate_data.mat[fold[[i]],], cl=data.cls.truth_akt[-fold[[i]]], k=9)
#preds <- ifelse(predict(model, features_selection_aktSubstrate_data.mat[fold[[i]],]) > 0.5, 1, 0)
TP <- c(TP, sum((data.cls.truth_akt[fold[[i]]] == preds)[data.cls.truth_akt[fold[[i]]] == "1"]))
TN <- c(TN, sum((data.cls.truth_akt[fold[[i]]] == preds)[data.cls.truth_akt[fold[[i]]] == "0"]))
FP <- c(FP, sum((data.cls.truth_akt[fold[[i]]] != preds)[preds == "1"]))
FN <- c(FN, sum((data.cls.truth_akt[fold[[i]]] != preds)[preds == "0"]))
}
mean(F1(cbind(TN, FP, TP, FN)))
## [1] 0.96
mean(Sen(cbind(TN, FP, TP, FN)))
## [1] 0.9333333
mean(Spe(cbind(TN, FP, TP, FN)))
## [1] 1
mean(Acc(cbind(TN, FP, TP, FN)))
## [1] 0.96
set.seed(1)
inTrain <- createDataPartition(data.cls.truth_mTOR, p = .6)[[1]]
allFeatures <- colnames(mTORSubstrate_data.mat)
train <- mTORSubstrate_data.mat[ inTrain,]
test <- mTORSubstrate_data.mat[-inTrain,]
cls.train <- data.cls.truth_mTOR[inTrain]
cls.test <- data.cls.truth_mTOR[-inTrain]
# use correlation to determine the first feature
cls.train.numeric <- rep(c(0, 1), c(sum(cls.train == "0"), sum(cls.train == "1")))
features <- c()
current.best.cor <- 0
for(i in 1:ncol(train)) {
if(current.best.cor < abs(cor(train[,i], cls.train.numeric))) {
current.best.cor <- abs(cor(train[,i], cls.train.numeric))
features <- colnames(train)[i]
}
}
print(features)
## [1] "X30s"
for (j in 2:10) {
selected.i <- selectFeature(train, test, cls.train, cls.test, features)
print(selected.i)
# add the best feature from current run
features <- c(features, selected.i)
}
## [1] "Avg.Fold"
## [1] "AUC"
## [1] "X15s"
## [1] "X1m"
## [1] "LY"
## [1] "X2m"
## [1] "X10m"
## [1] "X60m"
## [1] "Ins.1"
features_selection_mTORSubstrate_data.mat <- mTORSubstrate_data.mat[,features]
k <- 10
set.seed(1)
fold <- createFolds(data.cls.truth_mTOR, 10);
# gold standard (orignal data)
TP <- TN <- FP <- FN <- c()
for(i in 1:length(fold)){
#model <- svm(features_selection_mTORSubstrate_data.mat[-fold[[i]],], data.cls.truth_mTOR[-fold[[i]]])
preds <- knn (train=features_selection_mTORSubstrate_data.mat[-fold[[i]],], test=features_selection_mTORSubstrate_data.mat[fold[[i]],], cl=data.cls.truth_mTOR[-fold[[i]]], k=9)
#preds <- ifelse(predict(model, features_selection_mTORSubstrate_data.mat[fold[[i]],]) > 0.5, 1, 0)
TP <- c(TP, sum((data.cls.truth_mTOR[fold[[i]]] == preds)[data.cls.truth_mTOR[fold[[i]]] == "1"]))
TN <- c(TN, sum((data.cls.truth_mTOR[fold[[i]]] == preds)[data.cls.truth_mTOR[fold[[i]]] == "0"]))
FP <- c(FP, sum((data.cls.truth_mTOR[fold[[i]]] != preds)[preds == "1"]))
FN <- c(FN, sum((data.cls.truth_mTOR[fold[[i]]] != preds)[preds == "0"]))
}
mean(F1(cbind(TN, FP, TP, FN)))
## [1] 0.96
mean(Sen(cbind(TN, FP, TP, FN)))
## [1] 1
mean(Spe(cbind(TN, FP, TP, FN)))
## [1] 0.9333333
mean(Acc(cbind(TN, FP, TP, FN)))
## [1] 0.96
Feature selection did not cause the KNN model to perform at higher accuracy based on known samples.
The logistic regression model was build using step-wise feature selection to see if feature selection could remove noise from the data and improve model performance.
glm.akt<-glm(dat$aktSubstrate~dat$Avg.Fold+dat$AUC+dat$X15s+dat$X30s+dat$X1m+dat$X2m+dat$X5m+dat$X10m+dat$X20m+dat$X60m+dat$Ins.1+dat$LY+dat$Ins.2+dat$MK,data=dat,family = binomial)
summary(glm.akt)
##
## Call:
## glm(formula = dat$aktSubstrate ~ dat$Avg.Fold + dat$AUC + dat$X15s +
## dat$X30s + dat$X1m + dat$X2m + dat$X5m + dat$X10m + dat$X20m +
## dat$X60m + dat$Ins.1 + dat$LY + dat$Ins.2 + dat$MK, family = binomial,
## data = dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4773 -0.0288 -0.0146 -0.0075 4.2066
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.707e+00 1.107e+00 -1.542 0.123
## dat$Avg.Fold 1.274e+08 1.252e+08 1.018 0.309
## dat$AUC -1.497e+01 3.214e+00 -4.657 3.21e-06 ***
## dat$X15s -1.593e+07 1.565e+07 -1.018 0.309
## dat$X30s -1.593e+07 1.565e+07 -1.018 0.309
## dat$X1m -1.593e+07 1.565e+07 -1.018 0.309
## dat$X2m -1.593e+07 1.565e+07 -1.018 0.309
## dat$X5m -1.593e+07 1.565e+07 -1.018 0.309
## dat$X10m -1.593e+07 1.565e+07 -1.018 0.309
## dat$X20m -1.593e+07 1.565e+07 -1.018 0.309
## dat$X60m -1.593e+07 1.565e+07 -1.018 0.309
## dat$Ins.1 -2.659e-01 3.362e-01 -0.791 0.429
## dat$LY -4.172e-01 3.064e-01 -1.362 0.173
## dat$Ins.2 3.409e-01 3.236e-01 1.054 0.292
## dat$MK 2.495e-01 2.580e-01 0.967 0.334
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 321.46 on 12061 degrees of freedom
## Residual deviance: 175.06 on 12047 degrees of freedom
## AIC: 205.06
##
## Number of Fisher Scoring iterations: 11
#feature selection to build the model step by step
glm.akt1<-step(glm.akt)
## Start: AIC=205.06
## dat$aktSubstrate ~ dat$Avg.Fold + dat$AUC + dat$X15s + dat$X30s +
## dat$X1m + dat$X2m + dat$X5m + dat$X10m + dat$X20m + dat$X60m +
## dat$Ins.1 + dat$LY + dat$Ins.2 + dat$MK
##
## Df Deviance AIC
## - dat$X2m 1 175.40 203.40
## - dat$X60m 1 175.40 203.40
## - dat$X10m 1 175.40 203.40
## - dat$X30s 1 175.40 203.40
## - dat$Avg.Fold 1 175.40 203.40
## - dat$X5m 1 175.40 203.40
## - dat$X1m 1 175.40 203.40
## - dat$X15s 1 175.40 203.40
## - dat$X20m 1 175.40 203.40
## - dat$Ins.1 1 175.62 203.62
## - dat$MK 1 175.97 203.97
## - dat$Ins.2 1 176.21 204.21
## - dat$LY 1 176.80 204.80
## <none> 175.06 205.06
## - dat$AUC 1 204.03 232.03
##
## Step: AIC=203.4
## dat$aktSubstrate ~ dat$Avg.Fold + dat$AUC + dat$X15s + dat$X30s +
## dat$X1m + dat$X5m + dat$X10m + dat$X20m + dat$X60m + dat$Ins.1 +
## dat$LY + dat$Ins.2 + dat$MK
##
## Df Deviance AIC
## - dat$X60m 1 175.40 201.40
## - dat$X10m 1 175.52 201.52
## - dat$X30s 1 175.53 201.53
## - dat$Ins.1 1 175.94 201.94
## - dat$X5m 1 176.10 202.10
## - dat$MK 1 176.27 202.27
## - dat$X1m 1 176.37 202.37
## - dat$Ins.2 1 176.54 202.54
## - dat$LY 1 177.15 203.15
## <none> 175.40 203.40
## - dat$Avg.Fold 1 177.46 203.46
## - dat$X15s 1 179.55 205.55
## - dat$X20m 1 180.10 206.10
## - dat$AUC 1 204.24 230.24
##
## Step: AIC=201.4
## dat$aktSubstrate ~ dat$Avg.Fold + dat$AUC + dat$X15s + dat$X30s +
## dat$X1m + dat$X5m + dat$X10m + dat$X20m + dat$Ins.1 + dat$LY +
## dat$Ins.2 + dat$MK
##
## Df Deviance AIC
## - dat$X10m 1 175.53 199.53
## - dat$X30s 1 175.59 199.59
## - dat$Ins.1 1 175.95 199.95
## - dat$MK 1 176.29 200.29
## - dat$Ins.2 1 176.54 200.54
## - dat$X5m 1 176.65 200.65
## - dat$X1m 1 177.06 201.06
## - dat$LY 1 177.16 201.16
## <none> 175.40 201.40
## - dat$X20m 1 180.66 204.66
## - dat$Avg.Fold 1 180.87 204.87
## - dat$X15s 1 181.32 205.32
## - dat$AUC 1 204.90 228.90
##
## Step: AIC=199.53
## dat$aktSubstrate ~ dat$Avg.Fold + dat$AUC + dat$X15s + dat$X30s +
## dat$X1m + dat$X5m + dat$X20m + dat$Ins.1 + dat$LY + dat$Ins.2 +
## dat$MK
##
## Df Deviance AIC
## - dat$X30s 1 175.62 197.62
## - dat$Ins.1 1 176.07 198.07
## - dat$MK 1 176.35 198.35
## - dat$Ins.2 1 176.71 198.71
## - dat$X5m 1 176.78 198.78
## - dat$X1m 1 177.06 199.06
## - dat$LY 1 177.24 199.24
## <none> 175.53 199.53
## - dat$X20m 1 180.71 202.71
## - dat$Avg.Fold 1 181.69 203.69
## - dat$X15s 1 181.88 203.88
## - dat$AUC 1 205.00 227.00
##
## Step: AIC=197.62
## dat$aktSubstrate ~ dat$Avg.Fold + dat$AUC + dat$X15s + dat$X1m +
## dat$X5m + dat$X20m + dat$Ins.1 + dat$LY + dat$Ins.2 + dat$MK
##
## Df Deviance AIC
## - dat$Ins.1 1 176.19 196.19
## - dat$MK 1 176.59 196.59
## - dat$Ins.2 1 176.78 196.78
## - dat$X5m 1 176.80 196.80
## - dat$X1m 1 177.14 197.14
## - dat$LY 1 177.29 197.29
## <none> 175.62 197.62
## - dat$X20m 1 180.97 200.97
## - dat$X15s 1 181.97 201.97
## - dat$Avg.Fold 1 182.71 202.71
## - dat$AUC 1 207.84 227.84
##
## Step: AIC=196.19
## dat$aktSubstrate ~ dat$Avg.Fold + dat$AUC + dat$X15s + dat$X1m +
## dat$X5m + dat$X20m + dat$LY + dat$Ins.2 + dat$MK
##
## Df Deviance AIC
## - dat$Ins.2 1 176.92 194.92
## - dat$MK 1 176.93 194.93
## - dat$X5m 1 177.44 195.44
## - dat$X1m 1 177.56 195.56
## <none> 176.19 196.19
## - dat$LY 1 178.24 196.24
## - dat$X20m 1 181.45 199.45
## - dat$X15s 1 182.16 200.16
## - dat$Avg.Fold 1 182.81 200.81
## - dat$AUC 1 208.42 226.42
##
## Step: AIC=194.92
## dat$aktSubstrate ~ dat$Avg.Fold + dat$AUC + dat$X15s + dat$X1m +
## dat$X5m + dat$X20m + dat$LY + dat$MK
##
## Df Deviance AIC
## - dat$X5m 1 178.15 194.15
## - dat$X1m 1 178.35 194.35
## - dat$MK 1 178.43 194.43
## <none> 176.92 194.92
## - dat$LY 1 178.95 194.95
## - dat$X20m 1 181.99 197.99
## - dat$X15s 1 183.43 199.43
## - dat$Avg.Fold 1 184.19 200.19
## - dat$AUC 1 209.12 225.12
##
## Step: AIC=194.15
## dat$aktSubstrate ~ dat$Avg.Fold + dat$AUC + dat$X15s + dat$X1m +
## dat$X20m + dat$LY + dat$MK
##
## Df Deviance AIC
## - dat$X1m 1 178.83 192.83
## - dat$MK 1 179.75 193.75
## - dat$LY 1 179.94 193.94
## <none> 178.15 194.15
## - dat$X20m 1 182.08 196.08
## - dat$X15s 1 184.19 198.19
## - dat$Avg.Fold 1 185.86 199.86
## - dat$AUC 1 209.31 223.31
##
## Step: AIC=192.83
## dat$aktSubstrate ~ dat$Avg.Fold + dat$AUC + dat$X15s + dat$X20m +
## dat$LY + dat$MK
##
## Df Deviance AIC
## - dat$MK 1 180.47 192.47
## - dat$LY 1 180.51 192.51
## <none> 178.83 192.83
## - dat$X20m 1 182.26 194.26
## - dat$X15s 1 184.43 196.43
## - dat$Avg.Fold 1 193.54 205.54
## - dat$AUC 1 209.46 221.46
##
## Step: AIC=192.47
## dat$aktSubstrate ~ dat$Avg.Fold + dat$AUC + dat$X15s + dat$X20m +
## dat$LY
##
## Df Deviance AIC
## - dat$LY 1 181.04 191.04
## <none> 180.47 192.47
## - dat$X20m 1 183.41 193.41
## - dat$X15s 1 186.17 196.17
## - dat$Avg.Fold 1 194.14 204.14
## - dat$AUC 1 209.98 219.98
##
## Step: AIC=191.04
## dat$aktSubstrate ~ dat$Avg.Fold + dat$AUC + dat$X15s + dat$X20m
##
## Df Deviance AIC
## <none> 181.04 191.04
## - dat$X20m 1 183.74 191.74
## - dat$X15s 1 187.84 195.84
## - dat$Avg.Fold 1 194.17 202.17
## - dat$AUC 1 210.32 218.32
summary(glm.akt1)
##
## Call:
## glm(formula = dat$aktSubstrate ~ dat$Avg.Fold + dat$AUC + dat$X15s +
## dat$X20m, family = binomial, data = dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5329 -0.0325 -0.0177 -0.0098 4.0712
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.1601 1.0160 -2.126 0.03350 *
## dat$Avg.Fold 2.4666 0.5944 4.150 3.32e-05 ***
## dat$AUC -13.2065 2.8361 -4.657 3.21e-06 ***
## dat$X15s -0.7095 0.2590 -2.739 0.00616 **
## dat$X20m -0.7859 0.4447 -1.768 0.07714 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 321.46 on 12061 degrees of freedom
## Residual deviance: 181.04 on 12057 degrees of freedom
## AIC: 191.04
##
## Number of Fisher Scoring iterations: 11
#explain the model
coef(glm.akt1)
## (Intercept) dat$Avg.Fold dat$AUC dat$X15s dat$X20m
## -2.1600802 2.4665739 -13.2065475 -0.7094784 -0.7859271
exp(glm.akt1$coefficients)#the relationship between odds and x
## (Intercept) dat$Avg.Fold dat$AUC dat$X15s dat$X20m
## 1.153159e-01 1.178201e+01 1.838524e-06 4.919007e-01 4.556970e-01
exp(confint(glm.akt1))#Confidence interval
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 1.490772e-02 8.100763e-01
## dat$Avg.Fold 3.274079e+00 3.474837e+01
## dat$AUC 4.830945e-09 3.428494e-04
## dat$X15s 3.024994e-01 8.339941e-01
## dat$X20m 2.025290e-01 1.174675e+00
xp0.5<-0/glm.akt1$coefficients[]
xp0.5#return x which makes pi equal to 0.5
## (Intercept) dat$Avg.Fold dat$AUC dat$X15s dat$X20m
## 0 0 0 0 0
ratio05<-glm.akt1$coefficients[]*0.25
ratio05
## (Intercept) dat$Avg.Fold dat$AUC dat$X15s dat$X20m
## -0.5400201 0.6166435 -3.3016369 -0.1773696 -0.1964818
#evaluate the model
R2cox<-1-exp((glm.akt1$deviance-glm.akt1$null.deviance)/length(dat$aktSubstrate))
cat("Cox-Snell R2=",R2cox,"\n")
## Cox-Snell R2= 0.01157409
R2nag<-R2cox/(1-exp((-glm.akt1$null.deviance)/length(dat$aktSubstrate)))
R2nag
## [1] 0.440105
cat("Nagelkerke R2=",R2nag,"\n")
## Nagelkerke R2= 0.440105
plot(residuals(glm.akt1))
#install.packages('car')
influencePlot(glm.akt1)
## StudRes Hat CookD
## 1226 2.3888727 0.0250214253 0.07145550
## 2101 3.4312407 0.0003102064 0.02114883
## 3764 4.1273076 0.0001158346 0.09203907
## 5797 -0.9838078 0.1136121244 0.01633371
## 9115 -0.6180272 0.3295702238 0.02389170
fitted.pi<-fitted(glm.akt1)
ypred<-1*(fitted.pi>0.5)
length(ypred)
## [1] 12062
n<-table(dat$aktSubstrate,ypred)
n
## ypred
## 0 1
## 0 12037 3
## 1 22 0
cat("sensitivity=",n[2,2]/sum(n[2,]),"\n")
## sensitivity= 0
cat("specificity=",n[1,1]/sum(n[1,]),"\n")
## specificity= 0.9997508
#create the logistic regression taking all the features
glm.mtor<-glm(dat$mTORSubstrate~dat$Avg.Fold+dat$AUC+dat$X15s+dat$X30s+dat$X1m+dat$X2m+dat$X5m+dat$X10m+dat$X20m+dat$X60m+dat$Ins.1+dat$LY+dat$Ins.2+dat$MK,data=dat,family = binomial)
summary(glm.mtor)
##
## Call:
## glm(formula = dat$mTORSubstrate ~ dat$Avg.Fold + dat$AUC + dat$X15s +
## dat$X30s + dat$X1m + dat$X2m + dat$X5m + dat$X10m + dat$X20m +
## dat$X60m + dat$Ins.1 + dat$LY + dat$Ins.2 + dat$MK, family = binomial,
## data = dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8940 -0.0515 -0.0380 -0.0304 3.7002
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.135e+00 1.265e+00 -6.431 1.27e-10 ***
## dat$Avg.Fold -5.643e+07 5.946e+08 -0.095 0.924
## dat$AUC 1.448e+00 2.263e+00 0.640 0.522
## dat$X15s 7.053e+06 7.432e+07 0.095 0.924
## dat$X30s 7.053e+06 7.432e+07 0.095 0.924
## dat$X1m 7.053e+06 7.432e+07 0.095 0.924
## dat$X2m 7.053e+06 7.432e+07 0.095 0.924
## dat$X5m 7.053e+06 7.432e+07 0.095 0.924
## dat$X10m 7.053e+06 7.432e+07 0.095 0.924
## dat$X20m 7.053e+06 7.432e+07 0.095 0.924
## dat$X60m 7.053e+06 7.432e+07 0.095 0.924
## dat$Ins.1 4.525e-01 2.928e-01 1.546 0.122
## dat$LY -1.042e+00 1.722e-01 -6.052 1.43e-09 ***
## dat$Ins.2 1.217e-01 3.205e-01 0.380 0.704
## dat$MK 4.010e-01 2.664e-01 1.505 0.132
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 371.21 on 12061 degrees of freedom
## Residual deviance: 283.01 on 12047 degrees of freedom
## AIC: 313.01
##
## Number of Fisher Scoring iterations: 10
#feature selection to build the model step by step
glm.mtor1<-step(glm.mtor)
## Start: AIC=313.01
## dat$mTORSubstrate ~ dat$Avg.Fold + dat$AUC + dat$X15s + dat$X30s +
## dat$X1m + dat$X2m + dat$X5m + dat$X10m + dat$X20m + dat$X60m +
## dat$Ins.1 + dat$LY + dat$Ins.2 + dat$MK
##
## Df Deviance AIC
## - dat$X30s 1 283.02 311.02
## - dat$X10m 1 283.02 311.02
## - dat$X1m 1 283.02 311.02
## - dat$X2m 1 283.02 311.02
## - dat$X60m 1 283.02 311.02
## - dat$Avg.Fold 1 283.02 311.02
## - dat$X15s 1 283.02 311.02
## - dat$X5m 1 283.02 311.02
## - dat$X20m 1 283.02 311.02
## - dat$Ins.2 1 283.15 311.15
## - dat$AUC 1 283.42 311.42
## <none> 283.01 313.01
## - dat$MK 1 285.26 313.26
## - dat$Ins.1 1 285.59 313.59
## - dat$LY 1 315.08 343.08
##
## Step: AIC=311.02
## dat$mTORSubstrate ~ dat$Avg.Fold + dat$AUC + dat$X15s + dat$X1m +
## dat$X2m + dat$X5m + dat$X10m + dat$X20m + dat$X60m + dat$Ins.1 +
## dat$LY + dat$Ins.2 + dat$MK
##
## Df Deviance AIC
## - dat$Ins.2 1 283.16 309.16
## - dat$AUC 1 283.43 309.43
## - dat$X1m 1 283.86 309.86
## - dat$X2m 1 284.11 310.11
## - dat$X10m 1 284.18 310.18
## - dat$X60m 1 284.74 310.74
## <none> 283.02 311.02
## - dat$X15s 1 285.10 311.10
## - dat$MK 1 285.27 311.27
## - dat$Avg.Fold 1 285.42 311.42
## - dat$Ins.1 1 285.62 311.62
## - dat$X5m 1 285.96 311.97
## - dat$X20m 1 286.14 312.14
## - dat$LY 1 315.08 341.08
##
## Step: AIC=309.16
## dat$mTORSubstrate ~ dat$Avg.Fold + dat$AUC + dat$X15s + dat$X1m +
## dat$X2m + dat$X5m + dat$X10m + dat$X20m + dat$X60m + dat$Ins.1 +
## dat$LY + dat$MK
##
## Df Deviance AIC
## - dat$AUC 1 283.58 307.58
## - dat$X1m 1 284.04 308.04
## - dat$X2m 1 284.21 308.21
## - dat$X10m 1 284.32 308.32
## - dat$X60m 1 284.95 308.95
## <none> 283.16 309.16
## - dat$X15s 1 285.19 309.19
## - dat$Avg.Fold 1 285.55 309.55
## - dat$X5m 1 286.13 310.13
## - dat$X20m 1 286.41 310.41
## - dat$MK 1 286.45 310.45
## - dat$Ins.1 1 288.55 312.55
## - dat$LY 1 319.14 343.14
##
## Step: AIC=307.58
## dat$mTORSubstrate ~ dat$Avg.Fold + dat$X15s + dat$X1m + dat$X2m +
## dat$X5m + dat$X10m + dat$X20m + dat$X60m + dat$Ins.1 + dat$LY +
## dat$MK
##
## Df Deviance AIC
## - dat$X1m 1 284.45 306.45
## - dat$X2m 1 284.50 306.50
## - dat$X10m 1 284.59 306.59
## <none> 283.58 307.58
## - dat$X15s 1 285.58 307.58
## - dat$X60m 1 285.59 307.59
## - dat$Avg.Fold 1 285.98 307.98
## - dat$X5m 1 286.38 308.38
## - dat$X20m 1 286.94 308.94
## - dat$MK 1 286.94 308.94
## - dat$Ins.1 1 288.72 310.72
## - dat$LY 1 319.76 341.76
##
## Step: AIC=306.45
## dat$mTORSubstrate ~ dat$Avg.Fold + dat$X15s + dat$X2m + dat$X5m +
## dat$X10m + dat$X20m + dat$X60m + dat$Ins.1 + dat$LY + dat$MK
##
## Df Deviance AIC
## - dat$X2m 1 284.72 304.72
## - dat$X10m 1 284.77 304.77
## - dat$X15s 1 285.63 305.63
## - dat$X60m 1 285.65 305.65
## <none> 284.45 306.45
## - dat$X5m 1 286.45 306.45
## - dat$Avg.Fold 1 286.55 306.55
## - dat$X20m 1 287.48 307.48
## - dat$MK 1 287.91 307.91
## - dat$Ins.1 1 290.29 310.29
## - dat$LY 1 320.13 340.13
##
## Step: AIC=304.72
## dat$mTORSubstrate ~ dat$Avg.Fold + dat$X15s + dat$X5m + dat$X10m +
## dat$X20m + dat$X60m + dat$Ins.1 + dat$LY + dat$MK
##
## Df Deviance AIC
## - dat$X10m 1 284.86 302.86
## - dat$X60m 1 285.65 303.65
## - dat$X15s 1 285.68 303.68
## - dat$X5m 1 286.52 304.52
## <none> 284.72 304.72
## - dat$X20m 1 287.55 305.55
## - dat$MK 1 288.11 306.11
## - dat$Avg.Fold 1 288.15 306.15
## - dat$Ins.1 1 290.34 308.34
## - dat$LY 1 320.21 338.21
##
## Step: AIC=302.86
## dat$mTORSubstrate ~ dat$Avg.Fold + dat$X15s + dat$X5m + dat$X20m +
## dat$X60m + dat$Ins.1 + dat$LY + dat$MK
##
## Df Deviance AIC
## - dat$X15s 1 285.73 301.73
## - dat$X60m 1 285.77 301.77
## - dat$X5m 1 286.83 302.83
## <none> 284.86 302.86
## - dat$Avg.Fold 1 288.16 304.16
## - dat$X20m 1 288.31 304.31
## - dat$MK 1 288.91 304.91
## - dat$Ins.1 1 291.02 307.02
## - dat$LY 1 321.54 337.54
##
## Step: AIC=301.73
## dat$mTORSubstrate ~ dat$Avg.Fold + dat$X5m + dat$X20m + dat$X60m +
## dat$Ins.1 + dat$LY + dat$MK
##
## Df Deviance AIC
## - dat$X60m 1 286.57 300.57
## - dat$X5m 1 286.88 300.88
## <none> 285.73 301.73
## - dat$Avg.Fold 1 288.26 302.26
## - dat$X20m 1 288.41 302.41
## - dat$MK 1 290.07 304.07
## - dat$Ins.1 1 291.40 305.40
## - dat$LY 1 322.38 336.38
##
## Step: AIC=300.57
## dat$mTORSubstrate ~ dat$Avg.Fold + dat$X5m + dat$X20m + dat$Ins.1 +
## dat$LY + dat$MK
##
## Df Deviance AIC
## - dat$X5m 1 287.19 299.19
## - dat$Avg.Fold 1 288.31 300.31
## <none> 286.57 300.57
## - dat$X20m 1 291.07 303.07
## - dat$MK 1 291.31 303.31
## - dat$Ins.1 1 295.34 307.34
## - dat$LY 1 324.53 336.53
##
## Step: AIC=299.19
## dat$mTORSubstrate ~ dat$Avg.Fold + dat$X20m + dat$Ins.1 + dat$LY +
## dat$MK
##
## Df Deviance AIC
## - dat$Avg.Fold 1 288.31 298.31
## <none> 287.19 299.19
## - dat$X20m 1 291.43 301.43
## - dat$MK 1 292.15 302.15
## - dat$Ins.1 1 296.11 306.11
## - dat$LY 1 324.79 334.79
##
## Step: AIC=298.31
## dat$mTORSubstrate ~ dat$X20m + dat$Ins.1 + dat$LY + dat$MK
##
## Df Deviance AIC
## <none> 288.31 298.31
## - dat$X20m 1 292.88 300.88
## - dat$MK 1 294.08 302.08
## - dat$Ins.1 1 296.18 304.18
## - dat$LY 1 330.09 338.09
summary(glm.mtor1)
##
## Call:
## glm(formula = dat$mTORSubstrate ~ dat$X20m + dat$Ins.1 + dat$LY +
## dat$MK, family = binomial, data = dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0801 -0.0524 -0.0400 -0.0338 3.8343
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.2917 0.3284 -22.204 < 2e-16 ***
## dat$X20m 0.3612 0.1727 2.091 0.03651 *
## dat$Ins.1 0.5825 0.2140 2.722 0.00649 **
## dat$LY -1.0901 0.1604 -6.797 1.07e-11 ***
## dat$MK 0.5596 0.2230 2.509 0.01211 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 371.21 on 12061 degrees of freedom
## Residual deviance: 288.31 on 12057 degrees of freedom
## AIC: 298.31
##
## Number of Fisher Scoring iterations: 9
#explain the model
coef(glm.mtor1)
## (Intercept) dat$X20m dat$Ins.1 dat$LY dat$MK
## -7.2916990 0.3612087 0.5824630 -1.0900726 0.5595865
exp(glm.mtor1$coefficients)#the relationship between odds and x
## (Intercept) dat$X20m dat$Ins.1 dat$LY dat$MK
## 0.0006811698 1.4350628751 1.7904427850 0.3361920776 1.7499486853
exp(confint(glm.mtor1))#Confidence interval
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 0.0003362352 0.001229124
## dat$X20m 1.0296699001 2.014685022
## dat$Ins.1 1.1809544351 2.689006282
## dat$LY 0.2458088900 0.462603388
## dat$MK 1.1110386882 2.636092506
xp0.5<-0/glm.mtor1$coefficients[]
xp0.5#return x which makes pi equal to 0.5
## (Intercept) dat$X20m dat$Ins.1 dat$LY dat$MK
## 0 0 0 0 0
ratio05<-glm.mtor1$coefficients[]*0.25
ratio05
## (Intercept) dat$X20m dat$Ins.1 dat$LY dat$MK
## -1.82292475 0.09030217 0.14561574 -0.27251816 0.13989662
#evaluate the model
R2cox<-1-exp((glm.mtor1$deviance-glm.mtor1$null.deviance)/length(dat$mTORSubstrate))
cat("Cox-Snell R2=",R2cox,"\n")
## Cox-Snell R2= 0.006849037
R2nag<-R2cox/(1-exp((-glm.mtor1$null.deviance)/length(dat$mTORSubstrate)))
R2nag
## [1] 0.2259933
cat("Nagelkerke R2=",R2nag,"\n")
## Nagelkerke R2= 0.2259933
plot(residuals(glm.mtor1))
influencePlot(glm.mtor1)
## StudRes Hat CookD
## 636 -0.9516118 0.2045959821 0.03096525
## 3469 3.8729370 0.0001911857 0.05954741
## 4973 -1.1617307 0.1877282726 0.04506566
## 8839 3.0116993 0.0070395095 0.10194369
## 10346 3.6462645 0.0009243629 0.10866495
## 11110 3.6953068 0.0001305619 0.02274960
fitted.pi<-fitted(glm.mtor1)
ypred<-1*(fitted.pi>0.5)
length(ypred)
## [1] 12062
n<-table(dat$mTORSubstrate,ypred)
n
## ypred
## 0 1
## 0 12036 0
## 1 25 1
cat("sensitivity=",n[2,2]/sum(n[2,]),"\n")
## sensitivity= 0.03846154
cat("specificity=",n[1,1]/sum(n[1,]),"\n")
## specificity= 1
Adaptive sampling has been applied to the four models. Adaptive sampling assigns probability to the unlabeled data of being positive (e.g. Akt) or negative (e.g. not Akt). Once the class probability of the unlabeled data is assigned it is then fed back into the model on choice to give the model additional data, ideally resulting in better model performance.
# create empty list to combine all results for analysis.
result.final <- c()
dat$Identifier <- NULL
dat$Seq.Window <-NULL
# create an index to remember which rows are positively labelled as akt
aktIdx = which(dat$aktSubstrate==1)
# create an index to remember which rows are positively labelled as mTOR
mTORIdx = which(dat$mTORSubstrate==1)
aktSubstrate.dat <- dat[c(-16)]
# this does not make "0" into 0, and "1" into 1
aktSubstrate.dat$aktSubstrate = as.numeric(as.factor(aktSubstrate.dat$aktSubstrate))
aktSubstrate.dat$aktSubstrate_corrected = as.numeric(as.factor(aktSubstrate.dat$aktSubstrate)) - 1
dim(aktSubstrate.dat)
## [1] 12062 16
data.cls.truth <- aktSubstrate.dat$aktSubstrate_corrected
aktSubstrate.dat <- aktSubstrate.dat[,-c(15:16)]
dim(aktSubstrate.dat)
## [1] 12062 14
k = 10
set.seed(1)
fold=createFolds(data.cls.truth, k)
TP <- TN <- FP <- FN <- c()
for (i in 1:length(fold)){
train.mat = aktSubstrate.dat[-fold[[i]],]
test.mat = aktSubstrate.dat[fold[[i]],]
cls = data.cls.truth[-fold[[i]]]
#index positive and genative instances
Ps = rownames(train.mat)[which(cls == 1)]
Ns = rownames(train.mat)[which(cls == 0)]
pred.prob = adaSample_SVM(Ps, Ns, train.mat, test.mat, classifier = "svm", cost=1, gamma=0.01)
pred = ifelse(pred.prob[, "P"] >0.5, 1, 0)
TP <- c(TP, sum((data.cls.truth[fold[[i]]] == pred)[data.cls.truth[fold[[i]]] == "1"]))
TN <- c(TN, sum((data.cls.truth[fold[[i]]] == pred)[data.cls.truth[fold[[i]]] == "0"]))
FP <- c(FP, sum((data.cls.truth[fold[[i]]] != pred)[pred == "1"]))
FN <- c(FN, sum((data.cls.truth[fold[[i]]] != pred)[pred == "0"]))
}
mean(F1(cbind(TN, FP, TP, FN)))
## [1] 0.04680317
mean(Sen(cbind(TN, FP, TP, FN)))
## [1] 0.8966667
mean(Spe(cbind(TN, FP, TP, FN)))
## [1] 0.9309837
mean(Acc(cbind(TN, FP, TP, FN)))
## [1] 0.9308546
result.final <- rbind(result.final, c("AdaSampling (svm)", "Akt",
mean(F1(cbind(TN, FP, TP, FN))),
mean(Acc(cbind(TN, FP, TP, FN))),
mean(Sen(cbind(TN, FP, TP, FN))),
mean(Spe(cbind(TN, FP, TP, FN)))))
posLikeRatio = mean(Sen(cbind(TN, FP, TP, FN)))/(1-mean(Spe(cbind(TN, FP, TP, FN))))
paste("AdaSampling (svm) for.Akt - Positive Likelihood Ratio = ",posLikeRatio)
## [1] "AdaSampling (svm) for.Akt - Positive Likelihood Ratio = 12.9921003349213"
mTORSubstrate.dat <- dat[c(-15)]
mTORSubstrate.dat$mTORSubstrate = as.numeric(as.factor(mTORSubstrate.dat$mTORSubstrate)) #TKP this does not make "0" into 0, and "1" into 1
## TKP code ##
mTORSubstrate.dat$mTORSubstrate_corrected = as.numeric(as.factor(mTORSubstrate.dat$mTORSubstrate)) - 1
dim(mTORSubstrate.dat)
## [1] 12062 16
data.cls.truth <- mTORSubstrate.dat$mTORSubstrate_corrected
mTORSubstrate.dat <- mTORSubstrate.dat[,-c(15:16)]
k = 10
set.seed(1)
#data.cls.truth
fold=createFolds(data.cls.truth, k)
TP <- TN <- FP <- FN <- c()
for (i in 1:length(fold)){
train.mat = mTORSubstrate.dat[-fold[[i]],]
test.mat = mTORSubstrate.dat[fold[[i]],]
cls = data.cls.truth[-fold[[i]]]
#index positive and genative instances
Ps = rownames(train.mat)[which(cls == 1)]
Ns = rownames(train.mat)[which(cls == 0)]
pred.prob = adaSample_SVM(Ps, Ns, train.mat, test.mat, classifier = "svm", cost=1, gamma=0.01)
pred = ifelse(pred.prob[, "P"] >0.5, 1, 0)
TP <- c(TP, sum((data.cls.truth[fold[[i]]] == pred)[data.cls.truth[fold[[i]]] == "1"]))
TN <- c(TN, sum((data.cls.truth[fold[[i]]] == pred)[data.cls.truth[fold[[i]]] == "0"]))
FP <- c(FP, sum((data.cls.truth[fold[[i]]] != pred)[pred == "1"]))
FN <- c(FN, sum((data.cls.truth[fold[[i]]] != pred)[pred == "0"]))
}
mean(F1(cbind(TN, FP, TP, FN)))
## [1] 0.02001592
mean(Sen(cbind(TN, FP, TP, FN)))
## [1] 0.855
mean(Spe(cbind(TN, FP, TP, FN)))
## [1] 0.8148946
mean(Acc(cbind(TN, FP, TP, FN)))
## [1] 0.8149533
# add to result.final
result.final <- rbind(result.final,c("AdaSampling (svm)", "mToR",
mean(F1(cbind(TN, FP, TP, FN))),
mean(Acc(cbind(TN, FP, TP, FN))),
mean(Sen(cbind(TN, FP, TP, FN))),
mean(Spe(cbind(TN, FP, TP, FN)))))
posLikeRatio = mean(Sen(cbind(TN, FP, TP, FN)))/(1-mean(Spe(cbind(TN, FP, TP, FN))))
paste("AdaSampling (svm) for mToR - Positive Likelihood Ratio = ",posLikeRatio)
## [1] "AdaSampling (svm) for mToR - Positive Likelihood Ratio = 4.61898938993955"
dat$Identifier <- NULL
dat$Seq.Window <-NULL
# create an index to remember which rows are positively labelled as akt
aktIdx = which(dat$aktSubstrate==1)
# create an index to remember which rows are positively labelled as mTOR
mTORIdx = which(dat$mTORSubstrate==1)
aktSubstrate.dat <- dat[c(-16)]
# this does not make "0" into 0, and "1" into 1
aktSubstrate.dat$aktSubstrate = as.numeric(as.factor(aktSubstrate.dat$aktSubstrate))
aktSubstrate.dat$aktSubstrate_corrected = as.numeric(as.factor(aktSubstrate.dat$aktSubstrate)) - 1
dim(aktSubstrate.dat)
## [1] 12062 16
aktSubstrate.dat[1:2,]
## Avg.Fold AUC X15s X30s X1m X2m X5m
## 1 1.031512 0.2967288 -0.1858515 0.5813547 1.888572 2.101900 1.842677
## 2 1.216053 0.5279128 0.1077492 0.2802415 1.247199 2.027167 2.637666
## X10m X20m X60m Ins.1 LY Ins.2 MK
## 1 2.287874 1.343137 -1.60756467 0.6526217 0.34193177 2.157001 2.137921
## 2 2.374491 1.103292 -0.04938089 0.6112755 0.04620434 3.096346 3.226832
## aktSubstrate aktSubstrate_corrected
## 1 1 0
## 2 1 0
data.cls.truth <- aktSubstrate.dat$aktSubstrate_corrected
aktSubstrate.dat <- aktSubstrate.dat[,-c(15:16)]
k = 10
set.seed(1)
fold=createFolds(data.cls.truth, k)
TP <- TN <- FP <- FN <- c()
for (i in 1:length(fold)){
train.mat = aktSubstrate.dat[-fold[[i]],]
test.mat = aktSubstrate.dat[fold[[i]],]
cls = data.cls.truth[-fold[[i]]]
#index positive and genative instances
Ps = rownames(train.mat)[which(cls == 1)]
Ns = rownames(train.mat)[which(cls == 0)]
pred.prob = adaSample(Ps, Ns, train.mat, test.mat, classifier = "knn")
pred = ifelse(pred.prob[, "P"] >0.5, 1, 0)
TP <- c(TP, sum((data.cls.truth[fold[[i]]] == pred)[data.cls.truth[fold[[i]]] == "1"]))
TN <- c(TN, sum((data.cls.truth[fold[[i]]] == pred)[data.cls.truth[fold[[i]]] == "0"]))
FP <- c(FP, sum((data.cls.truth[fold[[i]]] != pred)[pred == "1"]))
FN <- c(FN, sum((data.cls.truth[fold[[i]]] != pred)[pred == "0"]))
}
mean(F1(cbind(TN, FP, TP, FN)))
## [1] 0.03898407
mean(Sen(cbind(TN, FP, TP, FN)))
## [1] 0.8966667
mean(Spe(cbind(TN, FP, TP, FN)))
## [1] 0.9203549
mean(Acc(cbind(TN, FP, TP, FN)))
## [1] 0.9202451
result.final <- rbind(result.final, c("AdaSampling (knn)", "Akt",
mean(F1(cbind(TN, FP, TP, FN))),
mean(Acc(cbind(TN, FP, TP, FN))),
mean(Sen(cbind(TN, FP, TP, FN))),
mean(Spe(cbind(TN, FP, TP, FN)))))
posLikeRatio = mean(Sen(cbind(TN, FP, TP, FN)))/(1-mean(Spe(cbind(TN, FP, TP, FN))))
paste("AdaSampling (knn) for.Akt - Positive Likelihood Ratio = ",posLikeRatio)
## [1] "AdaSampling (knn) for.Akt - Positive Likelihood Ratio = 11.2582823268712"
mTORSubstrate.dat <- dat[c(-15)]
mTORSubstrate.dat$mTORSubstrate = as.numeric(as.factor(mTORSubstrate.dat$mTORSubstrate)) #TKP this does not make "0" into 0, and "1" into 1
## TKP code ##
mTORSubstrate.dat$mTORSubstrate_corrected = as.numeric(as.factor(mTORSubstrate.dat$mTORSubstrate)) - 1
## TKP code end ##
data.cls.truth <- mTORSubstrate.dat$mTORSubstrate_corrected
mTORSubstrate.dat <- mTORSubstrate.dat[,-c(15:16)]
dim(mTORSubstrate.dat)
## [1] 12062 14
k = 10
set.seed(1)
fold=createFolds(data.cls.truth, k)
TP <- TN <- FP <- FN <- c()
for (i in 1:length(fold)){
train.mat = mTORSubstrate.dat[-fold[[i]],]
test.mat = mTORSubstrate.dat[fold[[i]],]
cls = data.cls.truth[-fold[[i]]]
#index positive and genative instances
Ps = rownames(train.mat)[which(cls == 1)]
Ns = rownames(train.mat)[which(cls == 0)]
pred.prob = adaSample(Ps, Ns, train.mat, test.mat, classifier = "knn")
pred = ifelse(pred.prob[, "P"] >0.5, 1, 0)
TP <- c(TP, sum((data.cls.truth[fold[[i]]] == pred)[data.cls.truth[fold[[i]]] == "1"]))
TN <- c(TN, sum((data.cls.truth[fold[[i]]] == pred)[data.cls.truth[fold[[i]]] == "0"]))
FP <- c(FP, sum((data.cls.truth[fold[[i]]] != pred)[pred == "1"]))
FN <- c(FN, sum((data.cls.truth[fold[[i]]] != pred)[pred == "0"]))
}
mean(F1(cbind(TN, FP, TP, FN)))
## [1] 0.02254541
mean(Sen(cbind(TN, FP, TP, FN)))
## [1] 0.815
mean(Spe(cbind(TN, FP, TP, FN)))
## [1] 0.8487982
mean(Acc(cbind(TN, FP, TP, FN)))
## [1] 0.8486133
# add to result.final
result.final <- rbind(result.final,c("AdaSampling (knn)", "mToR",
mean(F1(cbind(TN, FP, TP, FN))),
mean(Acc(cbind(TN, FP, TP, FN))),
mean(Sen(cbind(TN, FP, TP, FN))),
mean(Spe(cbind(TN, FP, TP, FN)))))
posLikeRatio = mean(Sen(cbind(TN, FP, TP, FN)))/(1-mean(Spe(cbind(TN, FP, TP, FN))))
paste("AdaSampling (knn) for mToR - Positive Likelihood Ratio = ",posLikeRatio)
## [1] "AdaSampling (knn) for mToR - Positive Likelihood Ratio = 5.39014815946263"
# create an index to remember which rows are positively labelled as akt
aktIdx = which(dat$aktSubstrate==1)
# create an index to remember which rows are positively labelled as mTOR
mTORIdx = which(dat$mTORSubstrate==1)
aktSubstrate.dat <- dat[c(-16)]
# this does not make "0" into 0, and "1" into 1
aktSubstrate.dat$aktSubstrate = as.numeric(as.factor(aktSubstrate.dat$aktSubstrate))
aktSubstrate.dat$aktSubstrate_corrected = as.numeric(as.factor(aktSubstrate.dat$aktSubstrate)) - 1
dim(aktSubstrate.dat)
## [1] 12062 16
aktSubstrate.dat[1:2,]
## Avg.Fold AUC X15s X30s X1m X2m X5m
## 1 1.031512 0.2967288 -0.1858515 0.5813547 1.888572 2.101900 1.842677
## 2 1.216053 0.5279128 0.1077492 0.2802415 1.247199 2.027167 2.637666
## X10m X20m X60m Ins.1 LY Ins.2 MK
## 1 2.287874 1.343137 -1.60756467 0.6526217 0.34193177 2.157001 2.137921
## 2 2.374491 1.103292 -0.04938089 0.6112755 0.04620434 3.096346 3.226832
## aktSubstrate aktSubstrate_corrected
## 1 1 0
## 2 1 0
data.cls.truth <- aktSubstrate.dat$aktSubstrate_corrected
aktSubstrate.dat <- aktSubstrate.dat[,-c(15:16)]
k = 10
set.seed(1)
fold=createFolds(data.cls.truth, k)
TP <- TN <- FP <- FN <- c()
for (i in 1:length(fold)){
train.mat = aktSubstrate.dat[-fold[[i]],-13]
test.mat = aktSubstrate.dat[fold[[i]],-13]
cls = data.cls.truth[-fold[[i]]]
#index positive and genative instances
Ps = rownames(train.mat)[which(cls == 1)]
Ns = rownames(train.mat)[which(cls == 0)]
pred.prob = adaSample(Ps, Ns, train.mat, test.mat, classifier = "logit")
pred = ifelse(pred.prob[, "P"] >0.5, 1, 0)
TP <- c(TP, sum((data.cls.truth[fold[[i]]] == pred)[data.cls.truth[fold[[i]]] == "1"]))
TN <- c(TN, sum((data.cls.truth[fold[[i]]] == pred)[data.cls.truth[fold[[i]]] == "0"]))
FP <- c(FP, sum((data.cls.truth[fold[[i]]] != pred)[pred == "1"]))
FN <- c(FN, sum((data.cls.truth[fold[[i]]] != pred)[pred == "0"]))
}
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
mean(F1(cbind(TN, FP, TP, FN)))
## [1] 0.03257058
mean(Sen(cbind(TN, FP, TP, FN)))
## [1] 0.955
mean(Spe(cbind(TN, FP, TP, FN)))
## [1] 0.8766783
mean(Acc(cbind(TN, FP, TP, FN)))
## [1] 0.876732
result.final <- rbind(result.final, c("AdaSampling (logit)", "Akt",
mean(F1(cbind(TN, FP, TP, FN))),
mean(Acc(cbind(TN, FP, TP, FN))),
mean(Sen(cbind(TN, FP, TP, FN))),
mean(Spe(cbind(TN, FP, TP, FN)))))
posLikeRatio = mean(Sen(cbind(TN, FP, TP, FN)))/(1-mean(Spe(cbind(TN, FP, TP, FN))))
paste("AdaSampling (logit) for.Akt - Positive Likelihood Ratio = ",posLikeRatio)
## [1] "AdaSampling (logit) for.Akt - Positive Likelihood Ratio = 7.74397143509124"
mTORSubstrate.dat <- dat[c(-15)]
mTORSubstrate.dat$mTORSubstrate = as.numeric(as.factor(mTORSubstrate.dat$mTORSubstrate)) #TKP this does not make "0" into 0, and "1" into 1
## TKP code ##
mTORSubstrate.dat$mTORSubstrate_corrected = as.numeric(as.factor(mTORSubstrate.dat$mTORSubstrate)) - 1
## TKP code end ##
data.cls.truth <- mTORSubstrate.dat$mTORSubstrate_corrected
mTORSubstrate.dat <- mTORSubstrate.dat[,-c(15:16)]
dim(mTORSubstrate.dat)
## [1] 12062 14
k = 10
set.seed(1)
fold=createFolds(data.cls.truth, k)
TP <- TN <- FP <- FN <- c()
for (i in 1:length(fold)){
train.mat = mTORSubstrate.dat[-fold[[i]],-13]
test.mat = mTORSubstrate.dat[fold[[i]],-13]
cls = data.cls.truth[-fold[[i]]]
#index positive and genative instances
Ps = rownames(train.mat)[which(cls == 1)]
Ns = rownames(train.mat)[which(cls == 0)]
pred.prob = adaSample(Ps, Ns, train.mat, test.mat, classifier = "logit")
pred = ifelse(pred.prob[, "P"] >0.5, 1, 0)
TP <- c(TP, sum((data.cls.truth[fold[[i]]] == pred)[data.cls.truth[fold[[i]]] == "1"]))
TN <- c(TN, sum((data.cls.truth[fold[[i]]] == pred)[data.cls.truth[fold[[i]]] == "0"]))
FP <- c(FP, sum((data.cls.truth[fold[[i]]] != pred)[pred == "1"]))
FN <- c(FN, sum((data.cls.truth[fold[[i]]] != pred)[pred == "0"]))
}
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
mean(F1(cbind(TN, FP, TP, FN)))
## [1] 0.01502278
mean(Sen(cbind(TN, FP, TP, FN)))
## [1] 0.8016667
mean(Spe(cbind(TN, FP, TP, FN)))
## [1] 0.7814113
mean(Acc(cbind(TN, FP, TP, FN)))
## [1] 0.7813835
# add to result.final
result.final <- rbind(result.final,c("AdaSampling (logit)", "mToR",
mean(F1(cbind(TN, FP, TP, FN))),
mean(Acc(cbind(TN, FP, TP, FN))),
mean(Sen(cbind(TN, FP, TP, FN))),
mean(Spe(cbind(TN, FP, TP, FN)))))
posLikeRatio = mean(Sen(cbind(TN, FP, TP, FN)))/(1-mean(Spe(cbind(TN, FP, TP, FN))))
paste("AdaSampling (logit) for mToR - Positive Likelihood Ratio = ",posLikeRatio)
## [1] "AdaSampling (logit) for mToR - Positive Likelihood Ratio = 3.66746583220171"
Prepare Akt substrate dataset for AdaSampling (mTOR substrate will have a different dataset)
dat.extra.ada <- dat.extra %>% select(-Type) # Remove sequence window and type features.
# Explicitly store the Identifier into a separate matrix to reference back later.
rowNameToIdentifierMap <- data.frame(rownames(dat.extra.ada), dat.extra.ada$Identifier, dat.extra.ada$Seq.Window)
dat.extra.ada$Seq.Window <- NULL # Drop this factor window
names(rowNameToIdentifierMap) <- c("rowName", "Identifier", "Seq.Window")
dat.extra.ada$Identifier <- NULL # Remove Identifier column (each identifier is unique and not useful for classification)
# Save the positive (Akt) and negative (mTOR / unknown) classes to a separate dataframe for use in AdaSample. Postive and negative classes are indexed in the adaSampleWrapperRF fucntion.
dat.extra.ada.akt_cls <- dat.extra.ada$aktSubstrate
dat.extra.ada.mtor_cls <- dat.extra.ada$mTORSubstrate
# Drop the labels from the dat.extra.ada dataframe
dat.extra.ada$aktSubstrate <-NULL
dat.extra.ada$mTORSubstrate <-NULL
colnames(dat.extra.ada)
## [1] "Avg.Fold" "AUC" "X15s" "X30s" "X1m" "X2m"
## [7] "X5m" "X10m" "X20m" "X60m" "Ins.1" "LY"
## [13] "Ins.2" "MK" "diffP1" "diffP2" "diffP3" "diffP4"
## [19] "diffP5" "diffP6" "diffP7" "s1" "s2" "s3"
## [25] "s4" "s5" "s6" "s7" "s8" "s9"
## [31] "s10" "s11" "s12" "s13"
#
# score <- adaSampleWrapperRF(dat = dat.extra.ada, cls = dat.extra.ada.mtor_cls, cls.known.neg = dat.extra.ada.akt_cls, folds = adafolds.mTOR, mtry=m, ntree=n)
#
# trn.cls <- cls[-fold]
# tst.cls <- cls[fold] # need to use to estimate sensitivity
#
# tst.cls.known.neg <- cls.known.neg[fold] # Need to use to estimate lower bound for specificity
#
# train.mat <- dat[-fold,]
# test.mat <- dat[fold,]
# cls <- dat.extra.ada.mtor_cls
# Index positive and negative instances, this assumes all unlabeled samples are of the negative class
Ps <- rownames(dat.extra.ada)[which(dat.extra.ada.akt_cls == 1)]
Ns <- rownames(dat.extra.ada)[which(dat.extra.ada.akt_cls == 0)]
## Perform AdaSampling on the Training Set and provide prediction on the test set
# require(AdaSampling)
pred.prob <- adaSample(Ps, Ns, dat.extra.ada, test=dat.extra.ada, classifier="rf", C=50, mtry=25, ntree=800)
dim(pred.prob)
## [1] 12062 2
head(pred.prob[,])
## N P
## 1 0.156625 0.843375
## 2 0.302300 0.697700
## 3 0.981075 0.018925
## 4 0.984100 0.015900
## 5 0.993950 0.006050
## 6 0.988075 0.011925
sum(dat.extra.ada.akt_cls==1)
## [1] 22
# summary of predictions
pred = ifelse(pred.prob[,"P"] > 0.5, 1, 0)
TP <- sum((dat.extra.ada.akt_cls == pred)[dat.extra.ada.akt_cls == "1"])
TN <- sum((dat.extra.ada.akt_cls == pred)[dat.extra.ada.akt_cls == "0"])
FP <- sum((dat.extra.ada.akt_cls != pred)[pred == "1"])
FN <- sum((dat.extra.ada.akt_cls != pred)[pred == "0"])
# add to result.final
result.final <- rbind(result.final,c("AdaSampling (RF)", "Akt",
F1(cbind(TN, FP, TP, FN)),
Acc(cbind(TN, FP, TP, FN)),
Sen(cbind(TN, FP, TP, FN)),
Spe(cbind(TN, FP, TP, FN))))
#head(dat.extra.ada.akt_cls)
posLikeRatio = Sen(cbind(TN, FP, TP, FN))/(1-Spe(cbind(TN, FP, TP, FN)))
paste("AdaSampling (Final) - Positive Likelihood Ratio = ",posLikeRatio)
## [1] "AdaSampling (Final) - Positive Likelihood Ratio = 15.6576665840971"
Ps <- rownames(dat.extra.ada)[which(dat.extra.ada.mtor_cls == 1)]
Ns <- rownames(dat.extra.ada)[which(dat.extra.ada.mtor_cls == 0)]
## Perform AdaSampling on the Training Set and provide prediction on the test set
# require(AdaSampling)
pred.prob.mTOR <- adaSample(Ps, Ns, dat.extra.ada, test=dat.extra.ada, classifier="rf", C=50, mtry=32, ntree=100)
dim(pred.prob.mTOR)
## [1] 12062 2
head(pred.prob.mTOR)
## N P
## 1 0.4622 0.5378
## 2 0.4874 0.5126
## 3 0.9012 0.0988
## 4 0.9442 0.0558
## 5 0.9514 0.0486
## 6 0.9406 0.0594
# summary of predictions
pred = ifelse(pred.prob.mTOR[,"P"] > 0.5, 1, 0)
TP <- sum((dat.extra.ada.mtor_cls == pred)[dat.extra.ada.mtor_cls == "1"])
TN <- sum((dat.extra.ada.mtor_cls == pred)[dat.extra.ada.mtor_cls == "0"])
FP <- sum((dat.extra.ada.mtor_cls != pred)[pred == "1"])
FN <- sum((dat.extra.ada.mtor_cls != pred)[pred == "0"])
# add to result.final
result.final <- rbind(result.final,c("AdaSampling (RF)", "mToR",
F1(cbind(TN, FP, TP, FN)),
Acc(cbind(TN, FP, TP, FN)),
Sen(cbind(TN, FP, TP, FN)),
Spe(cbind(TN, FP, TP, FN))))
sum(dat.extra.ada.mtor_cls==1)
## [1] 26
head(dat.extra.ada.mtor_cls)
## [1] 0 0 0 0 0 0
## Levels: 0 1
Compare predictions to those of the 2016 study. Use the best identified cost and gamma values for generating a final adasample on the whole dataset using an SVM as the base classifier inside the AdaSampling method.
# write the results into csv
result.final <- data.frame(result.final)
names(result.final) <- c("Method", "Substrate","F1","Accuracy","Sensitivity", "Specificity")
result.final <- transform(result.final, F1 = as.numeric(levels(F1))[F1],
Accuracy = as.numeric(levels(Accuracy))[Accuracy],
Sensitivity = as.numeric(levels(Sensitivity))[Sensitivity],
Specificity = as.numeric(levels(Specificity))[Specificity])
write.csv(result.final, './Output/result.final.csv')
# print the results for comparision
result.final
## Method Substrate F1 Accuracy Sensitivity
## 1 AdaSampling (svm) Akt 0.04680317 0.9308546 0.8966667
## 2 AdaSampling (svm) mToR 0.02001592 0.8149533 0.8550000
## 3 AdaSampling (knn) Akt 0.03898407 0.9202451 0.8966667
## 4 AdaSampling (knn) mToR 0.02254541 0.8486133 0.8150000
## 5 AdaSampling (logit) Akt 0.03257058 0.8767320 0.9550000
## 6 AdaSampling (logit) mToR 0.01502278 0.7813835 0.8016667
## 7 AdaSampling (RF) Akt 0.05405405 0.9390648 0.9545455
## 8 AdaSampling (RF) mToR 0.02135231 0.8176090 0.9230769
## Specificity
## 1 0.9309837
## 2 0.8148946
## 3 0.9203549
## 4 0.8487982
## 5 0.8766783
## 6 0.7814113
## 7 0.9390365
## 8 0.8173812
AktFinalPred <- data.frame(pred.prob,dat.extra.ada.akt_cls) %>% merge(rowNameToIdentifierMap, by.x="row.names", by.y="rowName") %>%arrange(desc(P)) %>% filter(dat.extra.ada.akt_cls == 0)
mTORFinalPred <- data.frame(pred.prob.mTOR,dat.extra.ada.mtor_cls) %>% merge(rowNameToIdentifierMap, by.x="row.names", by.y="rowName") %>%arrange(desc(P)) %>% filter(dat.extra.ada.mtor_cls == 0)
head(AktFinalPred %>% arrange(Row.names))
## Row.names N P dat.extra.ada.akt_cls Identifier
## 1 1 0.156625 0.843375 0 100043914;88;
## 2 10 0.920175 0.079825 0 1700009N14RIK;135;
## 3 100 0.967825 0.032175 0 A2MR;4525;
## 4 1000 0.859850 0.140150 0 ARHGAP17;575;
## 5 10000 0.960100 0.039900 0 RBM15;20;
## 6 10001 0.967150 0.032850 0 RBM15;211;
## Seq.Window
## 1 GRHERRSSAEQHS
## 2 RKVKAKSIVFHRK
## 3 RHSLASTDEKREL
## 4 LSPGDSSPPKPKD
## 5 PRWRRASPLCETS
## 6 HLSGSGSGDERVA
head(mTORFinalPred %>% arrange(Row.names))
## Row.names N P dat.extra.ada.mtor_cls Identifier
## 1 1 0.4622 0.5378 0 100043914;88;
## 2 10 0.9364 0.0636 0 1700009N14RIK;135;
## 3 100 0.7824 0.2176 0 A2MR;4525;
## 4 1000 0.9340 0.0660 0 ARHGAP17;575;
## 5 10000 0.9640 0.0360 0 RBM15;20;
## 6 10001 0.9450 0.0550 0 RBM15;211;
## Seq.Window
## 1 GRHERRSSAEQHS
## 2 RKVKAKSIVFHRK
## 3 RHSLASTDEKREL
## 4 LSPGDSSPPKPKD
## 5 PRWRRASPLCETS
## 6 HLSGSGSGDERVA
# Combine the results
final.pred.combined <- AktFinalPred %>% rename(AktProb = P) %>% select(Row.names, AktProb) %>% merge(mTORFinalPred %>% rename(mTORProb = P) %>% select(Row.names, mTORProb, Identifier, Seq.Window), by="Row.names") %>%
mutate(AktPred = ifelse(AktProb > mTORProb & AktProb > 0.5, 1,0),
mTORPred = ifelse(mTORProb > AktProb & mTORProb > 0.5, 1,0))
head(final.pred.combined)
## Row.names AktProb mTORProb Identifier Seq.Window AktPred
## 1 1 0.843375 0.5378 100043914;88; GRHERRSSAEQHS 1
## 2 10 0.079825 0.0636 1700009N14RIK;135; RKVKAKSIVFHRK 0
## 3 100 0.032175 0.2176 A2MR;4525; RHSLASTDEKREL 0
## 4 1000 0.140150 0.0660 ARHGAP17;575; LSPGDSSPPKPKD 0
## 5 10000 0.039900 0.0360 RBM15;20; PRWRRASPLCETS 0
## 6 10001 0.032850 0.0550 RBM15;211; HLSGSGSGDERVA 0
## mTORPred
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
Prediction2016.akt <- readxl::read_excel("./Data2/Prediction_2016.xlsx", sheet="Akt_prediction")
Prediction2016.akt <- Prediction2016.akt %>% mutate(Identifier=paste(str_to_upper(GeneSymbol),";" ,`Phosphorylation site`, ';', sep=""))
head(Prediction2016.akt)
## # A tibble: 6 x 11
## GeneSymbol `Phosphorylatio~ `Sequence windo~ `Uniprot ID` `IPI ID`
## <chr> <dbl> <chr> <chr> <chr>
## 1 Irs2 303 FRPRSKSQSSGSS B9EJW3 IPI0092~
## 2 Cep131 114 GRRRVASLSKASS B1AXI9 IPI0088~
## 3 Ndrg3 331 ARSRTHSTSSSIG Q9QYF9 IPI0013~
## 4 Flnc 2234 GRERLGSFGSITR Q8VHX6-1 IPI0066~
## 5 Cables1 272 RRCRTLSGSPRPK B2RWU9 IPI0092~
## 6 Rtkn 520 GRPRTFSLDAAPA Q8C6B2-1 IPI0022~
## # ... with 6 more variables: `Full model predict` <dbl>, `Motif
## # predict` <dbl>, `Phosphoproteome predict` <dbl>, Delta <dbl>, `Uniprot
## # database` <chr>, Identifier <chr>
Prediction2016.mtor <- readxl::read_excel("./Data2/Prediction_2016.xlsx", sheet="mTOR_prediction")
Prediction2016.mtor <- Prediction2016.mtor %>% mutate(Identifier=paste(str_to_upper(GeneSymbol),";" ,`Phosphorylation site`, ';', sep=""))
head(Prediction2016.akt)
## # A tibble: 6 x 11
## GeneSymbol `Phosphorylatio~ `Sequence windo~ `Uniprot ID` `IPI ID`
## <chr> <dbl> <chr> <chr> <chr>
## 1 Irs2 303 FRPRSKSQSSGSS B9EJW3 IPI0092~
## 2 Cep131 114 GRRRVASLSKASS B1AXI9 IPI0088~
## 3 Ndrg3 331 ARSRTHSTSSSIG Q9QYF9 IPI0013~
## 4 Flnc 2234 GRERLGSFGSITR Q8VHX6-1 IPI0066~
## 5 Cables1 272 RRCRTLSGSPRPK B2RWU9 IPI0092~
## 6 Rtkn 520 GRPRTFSLDAAPA Q8C6B2-1 IPI0022~
## # ... with 6 more variables: `Full model predict` <dbl>, `Motif
## # predict` <dbl>, `Phosphoproteome predict` <dbl>, Delta <dbl>, `Uniprot
## # database` <chr>, Identifier <chr>
head(Prediction2016.mtor)
## # A tibble: 6 x 11
## GeneSymbol `Phosphorylatio~ `Sequence windo~ `Uniprot ID` `IPI ID`
## <chr> <dbl> <chr> <chr> <chr>
## 1 Ulk2 772 GRVCVGSPPGPGL Q9QY01 IPI0033~
## 2 C2cd5 295 NQTYSFSPSKSYS Q7TPS5-1 IPI0040~
## 3 Ulk2 781 GPGLGSSPPGAEG Q9QY01 IPI0033~
## 4 Stk11ip 444 LETMGSSPLSTTK Q3TAA7 IPI0046~
## 5 Wdr91 257 PASLSQSPRVGFL Q7TMQ7 IPI0033~
## 6 Oxr1 62 RPPSPASPEGPDT Q8C715 IPI0065~
## # ... with 6 more variables: `Full model predict` <dbl>, `Motif
## # predict` <dbl>, `Phosphoproteome predict` <dbl>, Delta <dbl>, `Uniprot
## # database` <chr>, Identifier <chr>
dim(Prediction2016.akt)
## [1] 12267 11
dim(Prediction2016.mtor)
## [1] 12263 11
p.Sen <- ggplot(result.final) +
geom_col(aes(x=Method, y=Sensitivity, color=Method, fill=Method)) +
geom_text(data=result.final,aes(x=Method,y=Sensitivity, label=as.character(round(Sensitivity,2))),vjust=1.2, size=3) +
xlab("Method") +
ylab("Avg. Sensitivity over 10 Folds") +
scale_y_continuous(labels = percent) +
facet_wrap(~Substrate) +
ggtitle("AdaSample Classifiers - Sensitivity ") +
theme_light() +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
p.Sen
The sensitivity plot results above show the average sensitivity of the AdaSampled classifiers using 10 folds of cross validation to form the estimate. From the column chart plot we can see that using SVM as the base classifier out performs all other methods in both prediction of Akt and mTOR substrates. Note: This sensitivity measure is based only on the known labels of Akt and mTOR, there are potential unlabeled positives that are unaccounted for in the statistics above.
p.Spe <- ggplot(result.final) +
geom_col(aes(x=Method, y=Specificity, color=Method, fill=Method)) +
geom_text(data=result.final,aes(x=Method,y=Specificity, label=as.character(round(Specificity,2))),vjust=1.2, size=3) +
xlab("Method") +
ylab("Avg. Specificity over 10 Folds") +
scale_y_continuous(labels = percent) +
facet_wrap(~Substrate) +
ggtitle("AdaSample Classifiers - Specificity (lower bd) ") +
theme_light() +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
p.Spe
The specificity plot results above show the average specificity of the AdaSampled classifiers using 10 folds of cross validation to form the estimate. From the column chart plot we can see that using SVM as the base classifier out performs all other methods in both prediction of Akt and mTOR substrates (with only a small margin between RF and SVM for Akt substrate prediction). Note: This specificity measure is assumes all unlabeled samples are of the negative class for both Akt and mTOR. As a result, the specificity measured in this statistic is a ‘lower bound’ measurement of specificity as there are potential positive samples in the unknown set.
head(final.pred.combined)
## Row.names AktProb mTORProb Identifier Seq.Window AktPred
## 1 1 0.843375 0.5378 100043914;88; GRHERRSSAEQHS 1
## 2 10 0.079825 0.0636 1700009N14RIK;135; RKVKAKSIVFHRK 0
## 3 100 0.032175 0.2176 A2MR;4525; RHSLASTDEKREL 0
## 4 1000 0.140150 0.0660 ARHGAP17;575; LSPGDSSPPKPKD 0
## 5 10000 0.039900 0.0360 RBM15;20; PRWRRASPLCETS 0
## 6 10001 0.032850 0.0550 RBM15;211; HLSGSGSGDERVA 0
## mTORPred
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
# (1) Probability of each sample been mislabelled; and
write.csv(final.pred.combined, "./Output/Akt_mTOR_Prediction_Probabilities.csv")
# (2) A predicted list of mislabelled samples from training and test dataset, respectively.
## (2a - Akt)
write.csv(final.pred.combined %>% filter(AktPred == 1)%>% select(Identifier,Seq.Window, AktProb, AktPred), "./Output/Akt_Prediction_List.csv")
## (2b - mTOR)
write.csv(final.pred.combined %>% filter(mTORPred == 1) %>% select(Identifier,Seq.Window, mTORProb, mTORPred), "./Output/mTOR_Prediction_List.csv")
Note, the two datasets AktFinalPred and Prediction2016 appear to have different gene symbol, sites, although there is an intersect between the two.
# Akt
CompareFinalPred <- final.pred.combined %>% merge(Prediction2016.akt%>% dplyr::select(Identifier, `Sequence window`, `Full model predict`) %>% mutate(pred.2016.Akt = ifelse(`Full model predict` > 0.5, 1,0)), by.x=c("Identifier","Seq.Window"), by.y=c("Identifier", "Sequence window"), all.x=T, all.y=F) %>% rename(pred.2016.Akt.prob = `Full model predict`)
# mTOR
CompareFinalPred <- CompareFinalPred %>% merge(Prediction2016.mtor%>% dplyr::select(Identifier, `Sequence window`, `Full model predict`) %>% mutate(pred.2016.mtor = ifelse(`Full model predict` > 0.5, 1,0)), by.x=c("Identifier","Seq.Window"), by.y=c("Identifier", "Sequence window"), all.x=T, all.y=F) %>% rename(pred.2016.mTOR.prob = `Full model predict`)
CompareFinalPred$Row.names <- NULL
names(CompareFinalPred)
## [1] "Identifier" "Seq.Window" "AktProb"
## [4] "mTORProb" "AktPred" "mTORPred"
## [7] "pred.2016.Akt.prob" "pred.2016.Akt" "pred.2016.mTOR.prob"
## [10] "pred.2016.mtor"
head(CompareFinalPred)
## Identifier Seq.Window AktProb mTORProb AktPred mTORPred
## 1 100043914;88; GRHERRSSAEQHS 0.843375 0.5378 1 0
## 2 100043914;89; RHERRSSAEQHSE 0.697700 0.5126 1 0
## 3 1110018J18RIK;18; CAGRVPSPAGSVT 0.018925 0.0988 0 0
## 4 1110037F02RIK;133; SVDRVISHDRDSP 0.015900 0.0558 0 0
## 5 1110037F02RIK;138; ISHDRDSPPPPPP 0.006050 0.0486 0 0
## 6 1110037F02RIK;1628; FLSEPSSPGRSKT 0.011925 0.0594 0 0
## pred.2016.Akt.prob pred.2016.Akt pred.2016.mTOR.prob pred.2016.mtor
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 0.03973396 0 0.09465660 0
## 5 0.02154160 0 0.08976856 0
## 6 0.02307574 0 0.14314934 0
## CREDIT: https://ragrawal.wordpress.com/2011/05/16/visualizing-confusion-matrix-in-r/
# Code was adapted from URL above.
# Author: Ritesh Agarwal (2011)
#CompareFinalPred[which(!is.na(CompareFinalPred$pred.2016.Akt)),"pred.2016.Akt"]
#compute frequency of actual categories
actual.Akt = as.data.frame(table(CompareFinalPred[which(!is.na(CompareFinalPred$pred.2016.Akt)),"pred.2016.Akt"]))
names(actual.Akt) = c("Actual","ActualFreq")
#compute confusion matrix
confusion.Akt = as.data.frame(table(CompareFinalPred[which(!is.na(CompareFinalPred$pred.2016.Akt)),"pred.2016.Akt"], CompareFinalPred[which(!is.na(CompareFinalPred$pred.2016.Akt)),"AktPred"]))
names(confusion.Akt) = c("Actual","Predicted","Freq")
# confusion.Akt
# head(confusion.Akt)
# head(actual.Akt)
#calculate percentage of test cases based on actual frequency
confusion.Akt = merge(confusion.Akt, actual.Akt, by=c("Actual"))
confusion.Akt$Percent = confusion.Akt$Freq/confusion.Akt$ActualFreq*100
#render plot
# we use three different layers
# first we draw tiles and fill color based on percentage of test cases
tile.Akt <- ggplot() +
geom_tile(aes(x=Actual, y=Predicted,fill=Percent),data=confusion.Akt, color="black",size=0.1) +
labs(x="2016 Prediction",y="SVM Model Prediction") +
#scale_x_discrete(limits = c(1,0), position = 'top') +
theme(axis.text.x.top = element_text(angle = 0, vjust=0.5, hjust=0)) +
ggtitle("Akt Prediction comparison to Prediction2016.xlsx")
tile.Akt = tile.Akt +
geom_text(aes(x=Actual,y=Predicted, label=sprintf("%1.f (%.1f%%)", Freq, Percent)),data=confusion.Akt, size=3, colour="black", nudge_x = +0.0) +
scale_fill_gradientn(colours = rev(heat.colors(10)))
# lastly we draw diagonal tiles. We use alpha = 0 so as not to hide previous layers but use size=0.3 to highlight border
tile.Akt = tile.Akt +
geom_tile(aes(x=Actual,y=Predicted),data=subset(confusion.Akt, as.character(Actual)==as.character(Predicted)), color="black",size=0.3, fill="black", alpha=0)
#render
tile.Akt
From the pseudo confusion matrix above comparing predictions made by the
## CREDIT: https://ragrawal.wordpress.com/2011/05/16/visualizing-confusion-matrix-in-r/
# Code was adapted from URL above.
# Author: Ritesh Agarwal (2011)
#CompareFinalPred[which(!is.na(CompareFinalPred$pred.2016.mTOR)),"pred.2016.mTOR"]
#compute frequency of actual categories
actual.mTOR = as.data.frame(table(CompareFinalPred[which(!is.na(CompareFinalPred$pred.2016.mtor)),"pred.2016.mtor"]))
names(actual.mTOR) = c("Actual","ActualFreq")
#compute confusion matrix
confusion.mTOR = as.data.frame(table(CompareFinalPred[which(!is.na(CompareFinalPred$pred.2016.mtor)),"pred.2016.mtor"], CompareFinalPred[which(!is.na(CompareFinalPred$pred.2016.mtor)),"mTORPred"]))
names(confusion.mTOR) = c("Actual","Predicted","Freq")
# confusion.mTOR
# head(confusion.mTOR)
# head(actual.mTOR)
#calculate percentage of test cases based on actual frequency
confusion.mTOR = merge(confusion.mTOR, actual.mTOR, by=c("Actual"))
confusion.mTOR$Percent = confusion.mTOR$Freq/confusion.mTOR$ActualFreq*100
#render plot
# we use three different layers
# first we draw tiles and fill color based on percentage of test cases
tile.mTOR <- ggplot() +
geom_tile(aes(x=Actual, y=Predicted,fill=Percent),data=confusion.mTOR, color="black",size=0.1) +
labs(x="2016 Prediction",y="SVM Model Prediction") +
#scale_x_discrete(limits = c(1,0), breaks=c("1","0"), position = 'top') +
theme(axis.text.x.top = element_text(angle = 0, vjust=0.5, hjust=0)) +
ggtitle("mTOR Prediction comparison to Prediction2016.xlsx")
tile.mTOR = tile.mTOR +
geom_text(aes(x=Actual,y=Predicted, label=sprintf("%1.f (%.1f%%)", Freq, Percent)),data=confusion.mTOR, size=3, colour="black", nudge_x = +0.0) +
scale_fill_gradientn(colours = rev(heat.colors(10)))
# lastly we draw diagonal tiles. We use alpha = 0 so as not to hide previous layers but use size=0.3 to highlight border
tile.mTOR = tile.mTOR +
geom_tile(aes(x=Actual,y=Predicted),data=subset(confusion.mTOR, as.character(Actual)==as.character(Predicted)), color="black",size=0.3, fill="black", alpha=0)
#render
tile.mTOR
Based on the experimentation and interpretation of results, AdaSampling using random forest as a base classifier is epxected to perform best for novel substrate Akt and mTOR identified. This model will be selected as final classification for Kinase-substrate prediction due to following reasons,
1) Adaptive Sampling handles the positive labelled data well
2) When unlabeled positive labels are accounted for by (1), random forest is robust to class imbalance
3) Sensitivity and specificity is higher among all models when AdaSampling is applied compared to without
4) The RF AdaSample and Prediction 2016 results are more alike for mTOR substrate predictions than Akt substrate predictions for positive samples, based on confusion matrix agreement of (1 (RF AdaSample) and 1 (Prediction 2016)) 5) For negative labeling the predictions are much more closely aligned (0 (RF AdaSample) and 0 (Prediction 2016))
To truly understand the causes of the underlying difference in prediction, we must investigate in detail samples where the prediction varied. A code comparison to the experiments carried out to produce the “Prediction_2016.xlsx” results needs to be undertaken to identify potential areas where differences occurred, including feature generation, selection and classifier parameters used.